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ABSTRACT 

The  designer  of  a  concurrency  control  io  currently  focod  with  a  confuting  situation:  there 
hit  bun  i  oroMaration  of  nmnoiad  mattuifli  hut  thi  pariormanoa  of  varinre  Mihrift 
under  differing  systems  and  appticetions  is  unknown.  Furthermore,  the  optimal  method  for  a 
given  system  might  vary  with  system  usage;  in  tact,  some  methods  could  later  prove 
incompatible  with  dashed  system  changes  unforeseen  si  design  time. 


One  way  to  approach  these  problems  is  to  avoid  commitment  to  any  one  method.  Thie  can 
be  done  by  separating  policy  from  corractnees  in  the  design  of  the  concurrency  control. 
This  approach  allows  the  concurrency  control  method  to  easily  be  modffied  to  optimise 
desired  performance  characteristics,  or  to  satisfy  unforeseen  performance  criteria  Here,  thie 
approach  has  been  aucceeefuHy  used  for  the  case  in  which  concurrency  is  hidden  titim 
racoro  mmnagotmm  ana  appffcmon  programs.  inn  cm  m  renrea  more  preciaaiy  of 
presonung  a  tout  *iavai  nairewonv  tot  Transaction  pfocaaamg  sysafn  oiapi« 

Thie  framework  is  suitable  for  uniprocessor,  multiprocessor,  or  distributed  systems.  In  this 
framework  there  are  two  subsystems,  e  concurrency  oontroi  end  e  global  memory  mgasger, 
that  are  responsible  for  controlling  access  to  ad  shared  date  objects.  The  concurrency 
control  and  global  memory  manager  are  application-independent,  end  are  eesentisly 
autonomous.  They  may  be  functionally  distributed  between  two  processors,  or  distributed 
among  many  processors  by  partitioning  the  set  of  shared  date  objects.  Since  access  to  ai 
shared  data  objects  is  controlled  by  these  two  subsystems,  ail  other  subsystems  may  tie 
distributed  in  any  fashion  desired.  In  particular,  in  computer  networks  aN  other  subsystems 
may  be  replicated. 

A  general  paradigm  for  concurrency  control  is  developed  in  which  transactions  ere  not 
reouired  to  follow  any  crivcn  protocol,  ineteed.  possible  conflicts  are  detected  and  a  ooScv 
determines  how  the  poaatote  conflict  is  handled.  Thus,  the  two-phase  locking  protocol  is  juet 
ore  of  many  pomut  powcias,  ire  opumtsuc  nwnoo  tor  concurrency  control  can  an 
■npmireniM  oy  anotrer  policy*  i  re  paraovgm  is  oaatgnao  ao  mac  raganvaaa  or  a®  enusres 
mare  oy  ma  poacy  ma  rerereaa  ranunt  conareant.  rotcai  may  re  renrea  m  reoign*aRia» 
or  they  may  be  defined  by  modules  that  are  executed  at  run-time.  In  the  tetter  cats,  It  is 
poaetiiH  to  dywemfeely  change  poMctoa,  even  white  tits  system  is  in  use.  This  is  shown  to 
oa  vagmy  covnanavn  *or  poacy  oavaiopfyiant  ano  axparanamaaon* 

The  global  memory  manager  designs  eupport  multi -version  objects,  which  ere  used  to  cotes 
via  granuawy  pnmvani  to»  quonre  ireao*onvy  Irani  acuore;  uy  marang  a  unreoaaaary  rer 
queries  to  interact  wtih  tits  concurrency  control.  Some  properties  of  tits  global  atsawfy 
manager  designs  era  that  new  versions  of  objects  may  be  written  asynchronously  by 


at  the 


earliest  point  at  which  It  can  bo  guarantor!  that  no  future  tranaaction  or  query  w«  aceaoa 
that  version.  fi\ 

tranaaction  proceaalnfl  system  was  developed  far  Cm*,  a 
This  system  used  a  concurrency  control  in  which  aN 
functions  required  by  the  paradigm  were  avaBabie,  and  fle  concurrency  control  used  a 
policy  module  impiementing  a  number  of  poHciee,  any  of  which  could  be  ohoeon  et  ran  time. 
The  record  manager  of  this  system  supported  a  simple  relational  view  of  the  daWbaae,  with 
one  or  more  B-tree  indexes  for  each  relation.  The  record  manager  wee  rapBcstad,  and 
copies  of  versions  of  shared  data  objects  were  cached.  The  usefulness  of  tie  detfgn 
framework  was  illustrated  by  the  fact  that  the  record  manager,  the  moat  compfoxaubeyslsm, 
was  earlier  developed  on  a  completely  different  centralized  system,  and  required  only  minor 
modifications  to  be  used  in  this  system. 

Experiments  were  performed  with  varying  numbers  of  processors,  and  with  various  poRcles. 
For  this  system,  the  effect  of  waiting  due  to  locking  proved  to  be  negligible,  and  so  locking 
policies  generally  gave  the  best  performance.  This  result  may  not  apply  to  other  sydsms, 
though:  as  an  example  of  one  of  many  posefole  differences,  in  this  system  individual 
processors  were  not  midtfprogrammed.  However,  using  e  concurrency  control  poRcy 
module,  it  is  easy  to  investigate  many  different  concurrency  controls  on  any  system,  as 
oemoratratN  oy  tnes#  vxpanvmnv. 
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1 .  Introduction 

In  this  introduction  the  nature  of  transection  processing  will  be  examined.  Wowed  by  an 
identification  of  some  of  the  problems  unique  to  transaction  processing.  Then,  attar  Baling 
the  problems  of  concern  here,  this  work  and  its  relationship  to  previous  work  wM  be 
summarised. 

1.1.  What  ia  Transaction  Processing? 

The  nature  of  transaction  processing  systems  can  best  be  illustrated  by  considering  a 
number  of  examples.  Some  examples  of  commercial  transaction  processing  systems  include 
banking,  airline  reservation,  and  inventory  control  systems;  some  office  automation  examples 
include  memo  and  appointment  scheduling  systems;  an  example  of  software  engineering 
support  is  a  documentation  system  where  the  documentation  of  modules  is  added  or  edited 
as  the  modules  are  developed;  finally,  some  general  information  system  examples  include 
bulletin  boards,  shared  bibliographies,  and  personnel  directories. 

In  ail  of  these  examples  there  is  an  underlying  database  that  is  shared  by  a  number  of  users. 
Thus,  the  problem  of  transaction  proceesing  system  design  is  an  extension  of  the  more 
general  problem  of  database  design,  and  ail  problems  of  database  design  are  also  problems 
in  transaction  processing  system  design.  The  distinguishing  property  of  transection 
processing  systems  is  that  any  of  the  users  sharing  the  database  can  in  principle  modify  the 

Since  the  database  is  shared,  users  must  be  restricted  in  the  manner  in  wMch  they  are 
allowed  to  modify  the  database  otherwise  the  database  could  easily  become  internally 
inconsistent  (e.g.,  damaged  access  structures)  and  unusable,  or  externally  inconsistent  (e.g., 
containing  information  not  in  agreement  with  reality)  and  unreliable. 

A  formal  definition  of  internal  consistency  is  implementation  dependent.  For  example,  if  the 
database  is  structured  as  a  tree,  then  the  property  that  the  graph  formed  by  the  nodee  and 
pointers  in  this  structure  actually  be  a  tree  is  an  internal  consistency  property.  In  general, 
internal  consistency  properties  ere  properties  that  can  be  determined  to  hoM  or  not  by 
examination  only  of  the  information  in  the  database.  The  problem  of  preserving  internal 
consistency  leads  naturally  to  the  notion  of  a  transaction :  given  a  formal  definition  of  the 
shared  database  and  its  internal  consistency,  a  transaction  can  be  formal  defined  as  a 
procedure  that  modifies  the  database  tn  such  e  fashion  that  «  the  detett—  Is  internally 
consistent  before  the  transaction  is  executed,  then  the  database  is  internally  cttMMMMt  after 
the  transaction  has  completed.  By  restricting  users  to  modifying  the  detebeee  using  only 
procedures  that  are  strongly  believed  to  be  transactions,  either  through  testing  or 
correctness  proofs,  the  problem  of  preserving  internal  consistency  is  partly  Sofved  ~  the 
additional  problems  of  protection,  recovery,  and  concurrency  are  dtecuSled  below. 
Henceforth  calling  a  procedure  that  is  strongly  believed  to  be  a  transaction  simply  a 
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banking  deposit  or  withdraw  funds  to  or  from  an  account,  trsnshfrfratikilfrariBha 
account  to  another, 

airinaa  --  rhako  or  cancol  at-  reservation; 

inventory  control  --  add  or  subtract  from  the  quantity  on  hand  of  a  certain  Item; 

memo  system  ••  sand  a  memo  to  a  list  of  recipients,  remove  a  memo  from  a 
collection  of  memos; 

appointment  scheduling  -  try  to  schedule  a  meeting  for  a  set  of  people  in  »  given 
time  period,  cancel  a  previously  scheduled  meeting; 

documentation  -  add,  edit,  or  delete  documentation  for  a  new,  modified,  or 
obsolete  module; 

bufietin  boards  ••  post,  correct,  or  remove  a  new,  incorrect,  or  old  bulletin; 

shared  bibliographies  -  add  or  edit  a  new  or  incorrect  description  of  a  btofiograpNc 
entry; 

personnel  directories  ••  add,  edit,  or  delete  information  for  a  given  person. 

A  procedure  that  acceeeea  the  database  but  does  not  modify  the  database  is  ceded  a  query. 
Same  axamolaB  of  ouariee  are: 

banking  ~  retrieve  the  current  balance  in  an  account  or  set  of  accounts,  generate 
an  account  statement  (history); 

akfines  ~  retrieve  the  number  of  avofiable  easts  on  a  given  flight,  find  all  tights  or 
SBqufKiCM  of  inynu  wiin  yivwi  oepanuro  mo  OQotmanon  pomu; 

inventory  control  --  find  the  quantity  on  hand  of  a  certain  item,  find  the  total  value 
of  ell  stock; 

iMffio  ipom  ••  mo  t  iMmo,  nno  an  mnmon  on  a  (pvon  8uo|oci» 
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each  cast,  a- tnaaaa edon  correapende  to  an  event  that  hae  tslron  ptoeai  w».  tohe»plaeet  or 
eouMMw  ei«e»  in  the  "rani  world."  and  a  aoarv  edrraaoonda  to  outside  datndbn  of  tha 

Wwiiaa  en^p^^p  ww  s^P^en  an  egap^pv  y  a^a^a  v^pw^^a^a^ap^p  ^sp  apapa^pwi^np  ei^w^ptev .  age  ep^np 

"mi  world."  M  HkMid  in  Raun  1.1.  Noli  that  in  si  of  the  ahnvn  informal  daacriationa 

w  “ * •  ■  •  •  »wa^  a*  a^a  «vmr<  w^pvp^^p  .^aaaapnaawi^a'.- as^^nSrW^vninePWnr^nr 
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ia  accessed  by  the  transaction.  Thto  to  probably  bncnuaa  tha  transactions  modal  events 
tha  "rani  world,”  and  in  tha  raal  world,  thara  to  a  locality  principle:  objects  cannot 
genera I  affect  each  other  unlees  they  are  "dose",  and  tew  objaote  can  in  general  be 
simultaneously  "close".  In  any  case,  in  mod  transaction  processing  systems  aN  transactions 
seem  to  be  small  in  thto  sense.  However,  it  is  easy  to  imagine  useful  quartos  of  all  sizes,  as 
Should  be  dear  from  the  above  exam  pies.  These  general  properties  of  queries  and 

transactions  have  implications  for  transaction  processing  system  design,  and  wNI  be 
discussed  toftof* 
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Figure  1.1.  Transection  Processing 


S  £ 


4 


Design  of  Concurrency  Controls  for  Transaction  Processmg  Systems 


The  problem  of  guaranteeing  external  consistency  is  usually  approached  by  protection 
mechanisms  -  the  goal  is  to  allow  each  user  to  modify  only  those  parts  of  the  database  for 
which  that  user  can  be  trusted  or  assumed,  first,  to  Know  the  true  state  of  the  corresponding 
part  of  the  "real  world”,  and  second,  to  enter  this  information  correctly.  This  is  the 
transaction  side  of  protection;  there  is  also  a  query  side  of  protection,  where  the  goal  it  to 
allow  each  user  to  access  in  a  query  only  those  parts  of  the  database  for  which  that  ueer 
has  a  right  to  examine. 

1.2.  Problems  of  Transaction  Processing 

Ignoring  for  the  moment  problems  of  database  design,  those  problems  introduced  by 
transaction  processing  will  now  be  considered.  Two  problems  of  transaction  processing  -- 
increasing  confidence  that  a  procedure  is  in  fact  a  transaction,  and  protection  ••  have 
already  been  mentioned  above.  However,  these  problems  are  not  unique  to  transaction 
processing:  the  first  problem  can  be  seen  as  an  instance  of  the  more  general  problem  of 
program  verification  and  testing,  and  the  problem  of  designing  protection  mechanisms  ••  a 
problem  for  any  system  with  shared  resources  -  does  not  seem  significantly  changed  by  the 
nature  of  transaction  processing  (however,  protection  policies  may  be  more  complex,  as  in 
statistical  databases). 

Another  problem  is  that  of  recovery.  The  database  can  become  internally  inconsistent  due 
to  hardware  failures  or  software  errors;  it  can  also  become  externally  inconsistent  due  to 
human  errors.  In  such  cases  it  is  desirable  to  restore  the  database  to  some  earlier  slate  that 
is  believed  to  be  consistent,  relying  on  die  hopefully  increasing  reliability  of  the  lower  levels 
of  memory  hierarchies  for  recovery  of  the  previous  stats.  However,  it  is  also  desirable  to 
undo  as  little  as  possible,  that  is  to  restore  the  database  to  the  most  recent  such  state.  This 
latter  goal  can  be  very  important  in  some  transaction  processing  applications  (s.g.t  for 
economic  reasons),  thus  in  a  sense  making  the  problem  unique  to  transaction  processing, 
even  though  it  is  a  desirable  goal  for  any  system. 

The  final  problem  is  that  of  concurrency:  even  if  transactions  individually  preserve  internal 
consistency,  concurrent  execution  of  transactions  may  cause  internal  consistency  to  be  lost, 
as  shown  by  the  following  simple  example. 

The  database  consists  of  four  integer  variables  X,  Y,  Z,  and  Mr.  Furthermore, 
internal  consistency  requires  that  X  +  Y+Z  +  W  -  4.  if  currently  X  ■  1,  Y  »  3,  Z  • 

W  ■  0,  consider  die  following  interleaved  execution  of  two  transactions  A  and  B  (Z 
and  W  are  unused  here,  but  will  be  referred  to  below).  Note  that  temp  is  a  local 
variable  for  each  transaction. 
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Transaction  A 

Transaction  B 

tempA,  tempB, 

X,  Y,  Z 

move  1  from  X  to  Y 

move  1  from  Y  to  X 

1, 

3,  0,  0 

0) 

temp  :  •  X 

1, 

1. 

3,  0,  0 

(2) 

temp  :  ■  Y 

1. 

3, 

1. 

3,  0,  0 

(3) 

Y :  •  temp-1 

1, 

3, 

1, 

2,  0,  0 

(4) 

temp  :  ■  X 

1, 

1, 

1, 

2.  0.  0 

(5) 

X  :■  temp-1 

1, 

1. 

0. 

to 

o 

o 

(8) 

temp  :  ■  Y 

2, 

1. 

0, 

2,  0,  0 

< 7 ) 

Y  :  •  temp  ♦ 1 

2, 

1. 

0, 

3,  0,  0 

(8) 

X  :  *  temp  *  1 

2, 

1. 

2, 

3.  0,  0 

Clearly,  each  transaction  individually  preserves  X+Y+Z+W  »  4,  but  the  result  is  that 
X+Y+Z+W  ■  5.  A  transaction  processing  subsystem  that  prevents  or  resolves  interactions 
such  as  this  will  be  called  a  concurrency  control. 

At  first  this  problem  may  not  seem  significantly  different  from  the  general  synchronization 
problem  (see  [Andter  79]  for  a  survey).  However,  there  is  a  fundamental  difference:  in 
general,  the  objects  that  will  be  accessed  by  s  transaction  cannot  be  determined  in  advance 
of  the  actual  accesses.  This  is  in  contrast  to  an  operating  system,  say,  where  the  shared 
data  structures  accessed  by  a  particular  module  are  usually  determined  at  design  time.  The 
problem  is  more  closely  related  to  that  of  allocation  of  shared  resources,  with  no  prior 
claiming  of  resouces.  This  similarity  has  led  historically  to  locking-style  concurrency 
controls,  that  is,  concurrency  controls  in  which  access  to  an  object  is  in  some  eases 
restricted  to  at  most  one  transaction.  However,  there  are  many  more  concurrency  controls 
than  locking  concurrency  controls  (for  example,  see  Chapter  5).  The  difference  is  that 
concurrent  access  to  an  object  need  not  be  disastrous,  as  it  usually  would  be  If  it  were  to  a 
tape  drive,  for  example. 

The  reason  that  accesses  cannot  be  predicted  in  advance  is  that  the  actions  taken  by  a 
transaction  are  in  general  dependent  on  the  data  read.  This  can  be  true  at  ail  levels  of  the 
system.  For  example,  at  the  conceptual  level  (see  Chapter  2),  it  is  easy  to  imagine 
transactions  of  the  form:  "tar  all  X,  Y  satisfying  certain  constraints,  if  X  <  Y  then  update  X, 
othsrwiee  update  Y”. 

An  example  at  the  physical  access  level  is  the  use  of  dynamic  index  structures  (see 
Appendix  I).  In  such  cases,  the  sat  of  objects  that  win  be  accessed  by  a  transaction,  other 
than  the  root  of  the  index,  la  completely  unknown  prior  to  execution  (see  Figure  1.2). 
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Figure  1 .2.  Access  Path  Determined  at  Execution-Time 


Since  accesses  cannot  in  general  be  predicted  in  advance,  it  will  be  necessary  to  abort 
transactions  (i.e.,  undo  all  modifications  to  the  shared  database).  Consider  the  example  of 
transactions  A  and  0  above.  Until  slop  (4)  is  reached,  it  lust  as  wel  could  have  been  the 
case  that  A  would  access  X  and  Z  only  and  that  8  would  access  V  and  W  only.  However, 
once  step  (4)  is  reached,  one  of  the  two  transactions  wit  eventually  have  to  be  aborted. 
Also,  it  is  desirable  to  allow  users  or  application  programs  to  abort  transactions. 

These,  then,  are  some  of  the  problems  introduced  by  transaction  processing.  However, 
solutions  to  these  problems  must  tahe  into  account  the  underlying  database  system.  Seme 
of  the  inadequacies  in  existing  database  systems  result  from  needs  to  process  (1)  more  data 
for  (2)  more  users ,  (3)  more  quickly,  (4)  more  reHebly,  and  (5)  less  expensively.  One  current 
approach  to  these  inadequacies  relies  on  the  rapidly  decreasing  coat  of  processing  power. 
Apparently,  though,  the  cost  of  large,  reliable  memory  and  various  apodal  devices  (such  as 
high-quality  printers)  is  not  decreasing  nearly  so  rapidly.  These  economic  considerations 
have  lad  to  architectures  such  as  that  of  Figure  1.3,  in  which  (cheap)  processing  power  is 

OWwIOUIBOi  OUI  (wXp®njiV8|  louQoi  iwMDiw  ni6l»«0iy  ®nO  «P9Cw  (NVwR  Wp  SRMIOt 

The  architecture  of  Figure  1.3  seems  ideally  suited  for  essentially  mm-shared,  personal 
applications;  the  problem  introduced  by  transaction  processing  in  this  ciae  lb  to  we  this 
architecture  affectively  in  a  htahlv- shared  aootication  For  examoie.  eohriha  |be 
control  protiltin  bv  exacutino  tmninr  ttnna  asauentiativ  (belch  oreeaa^na)  would  in  oenerel 
be  unacceptable  since  only  the  central  computer  would  be  uANtemd  to  any  degree  stal  (ftie 
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Figure  1.3.  Personal  Computer  Network  with  Shared  Expensive  Devices 


is  often  true  for  uniprocessor  systems  as  well,  where  there  may  actually  be  parallel 
processing  due  to  multiple  I/O  devices). 

1.3.  Problems  Considered  Here 

The  major  probienuconsidered  here  is  that  of  concurrency  control  design,  subject  to  two 
constraints.  The  first  constraint  is  that  the  concurrency  control  should  be  as  appScsflon 
independent  as  possible.  The  advantages  of  this  are  overwhelming,  and  are  more  flsMy 
dbcusead  in  Chapter  2. 

The  second  constraint  is  one  of  efficiency.  The  architecture  of  Figure  1.3  to  Just  an  example; 
the  concurrency  control  should  apply  to  other  architectures  as  wel.  A  problem  Is  Mat  s 
particular  concurrency  control  could  be  efficient  for  some  architectures  and  not  for  others. 

tWfl  QleBn  Hlw  Ww«rn9wUvV|  UiSTS  iff  RMB*/  iWmONBi  SUCH  SS  IsSSSOaK  MeO  MISHg 

bandwidth*,  probability  of  transaction  conflict  average  transaction  and  query  ate,  etc.  The 
second  constraint  then,  is  that  the  concurrency  control  should  bo  genereh.  a*  concurrency 
controls,  subject  to  certain  axpUdt  design  criteria  (such  as  appdestton  independence), 
should  bt  imNssMs  ••  then  nouM  bs  no  Mddm  dutan  erihrii.  This  rauht  ill  s  dHtan 
im  esn  os*  morsci  for  m  spscinc  snvvrofwnsni  oy  nxsig  ins  rmriPB  ossgn  ensns 
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Another  problem  considered  here  is  that  of  query  size.  As  noted  above,  queries  can  bo 
arbitrarily  large.  But  queries  do  not  modify  the  database;  it  seems  that  it  should  be  possible 
to  run  queries  without  any  concurrency  control  interaction.  In  fact  this  is  the  case,  and 
several  solutions  will  be  presented,  but  the  solutions  rely  on  multi-version  objects.  This  then 
raises  the  problem  of  garbage  collecting  old  versions,  which  is  considered  here  as  well. 
Multi-version  objects  are  also  of  use  in  distributed  systems  for  determining  when  cached 
copies  of  shared  objects  are  "out-of-date",  and  they  are  generally  useful  for  recovery 
purposes. 

A  related  problem  is  one  of  transaction  size.  Even  if  most  transactions  are  small,  there  may 
be  some  occasional  large  transactions  (e.g.,  a  transaction  that  physically  reorganizes  the 
database).  Furthermore,  poor  database  design  can  result  in  conceptually  small  transactions 
physically  accessing  large  parts  of  the  database.  This  is  known  as  the  granularity  problem, 
and  can  be  solved  by  hierarchical  concurrency  controls,  such  as  that  discussed  in  [Gray  78]. 
Another  approach  is  to  attempt  to  design  the  database  so  that  large  transactions  are  rare. 
This  is  the  approach  assumed  here,  and  for  this  reason,  earlier  work  related  to  one  aspect  of 
this  problem  -  design  of  dynamic  index  structures  -  is  presented  in  Appendix  I.  This  is  also 
a  problem  in  design  at  higher  levels:  an  example  often  used  to  illustrate  a  large  transaction 
is  one  that  raises  all  salaries  in  an  employee  file  by  5%,  say  -  however,  if  salaries  were  not 
stored  in  absolute  terms,  but  instead  were  stored  as  relative  values,  with  a  single  data  item 
giving  the  conversion,  this  apparently  large  transaction  becomes  very  small.  Furthermore, 
this  design  can  be  generalized  if  necessary  to  contain  a  number  of  conversion  factors  for 
different  classes  of  employees.  This  example  is  mentioned  only  to  show  that  the  necessity 
for  large  transactions  is  not  always  dear.  In  any  case,  even  if  there  sre  large  transactions, 
in  most  applications  they  will  be  rare,  and  it  is  trivial  to  generalize  non-Merarchical 
concurrency  controls  to  simple  hierarchical  concurrency  controls  -  for  example,  an  option 
can  be  added  so  that  transactions  can  request  that  the  entire  dstabesa  be  locked.  Finally, 
the  effective  use  of  hierarchical  concurrency  controls  requires  advance  knowledge  of  the 
behavior  of  some  transactions,  contrary  to  the  first  constraint  above.  For  these  reasons 
hierarchical  concurrency  controls  are  not  considered  here. 

1 .4.  Summary  of  this  Work 

The  problems  of  transaction  processing  considered  here  have  been  described  above.  A  goal 
of  this  thesis  is  to  solve  these  problems  in  such  a  fashion  that  the  designs  wW  be  usable  in  a 
wide  variety  of  applications  and  systems.  This  problem  is  made  more  difficult  by  the  fact  that 

mb  ewt  4bbb  IS  attach  Seat  seeaeMtfMtaaai 

an  initially  accoptaoio  oacign  corns  war  provo  to  do  unaccoptaova  it  mo  oacipn  amimfHions 
are  violated.  Furthermore,  this  difficulty  is  becoming  ever  larger  ss  the  varietiee  and  uses  of 
muniprocofsor  ana  computor  narwonv  cystoma  grow,  suo  to  mo  aoooo  cwnpmiy  or  new 
symovns,  iwst#  ptoro  ats  wi  goootai  many  moro  owmano  matt  at  a  mpiiii^wiw 

^f^^ovtti  aiKJ  aocono,  a  la  ittuctt  moto  onvCxa  to  picofCi  m  BDnCtWf  nwr  wo 
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Cne  well-known  approach  to  this  problem  ia  Information-hiding  -  by  isolating  design 
decisions,  and  decomposing  the  system  into  modules  that  hide  these  decisions,  a  much 
mom  flexible  design  results.  In  Chapter  2  a  design  framework  for  transaction  processing 
systems  to  developed  based  on  information-hiding  principles.  This  framework  has  two 
subsystems,  a  concurrency  control  and  a  memory  manager,  that  control  access  to  aN  shared 
data  objects  (it  to  assumed  that  in  distributed  systems  the  memory  manager  to  decomposed 
into  a  number  of  local  memory  managers,  and  a  global  memory  manager).  Due  to  these 
subsystems,  the  framework  can  be  used  in  a  wide  variety  of  systems,  some  examples  of 
which  are  given.  The  properties  the  concurrency  control  must  satisfy  are  also  developed  in 
Chapter  2  using  a  formal  model  that  has  sufficient  detail  to  be  immediately  applicable  in 
practice. 

In  Chapter  3  an  overview  of  a  system  using  the  decomposition  of  Chapter  2  to  presented. 
The  design  decisions  that  are  common  to  various  subsystems  are  discussed,  and  the 
communication  protocols  between  subsystems  are  given. 

Another  approach  to  the  generality  problem  may  be  called  corractneaa/poUcy  separation. 
When  designing  a  module  to  solve  a  given  problem,  decisions  are  made  at  several  levels, 
ranging  from  the  level  of  fundamental  correctness  to  the  level  of  pure  policy.  For  example,  in 
the  case  of  concurrency  control,  the  module  must  be  designed  so  that  internal  inconsistency 
of  the  database  is  never  allowed  due  to  interactions  between  concurrent  transactions  -  this 
to  a  fundamental  correctness  property.  It  is  also  desirable  in  some  applications  that  when 
transactions  conflict,  the  transaction  with  the  eerier  starting  time  be  given  priority  -  dearly  a 
policy.  By  separating  these  levels  of  decisions,  and  then  applying  information-hiding,  a 
general  design  results.  In  Chapter  4  a  design  paradigm  for  concurrency  controls  to 
developed  using  only  those  assumptions  that  are  necessary  for  correctness,  application 
independence,  and  practicality.  This  paradigm  has  the  property  that  regarcHeaa  of  die 
decisions  made  by  any  particular  policy,  the  concurrency  control  wM  remain  fundamentally 
correct. 

Policies  may  be  defined  statically  at  design-time,  or  they  may  be  defined  dynamically  by 
poUcy  modules  that  are  executed  at  run-time.  This  latter  apporach  makes  it  vary  easy  to 
experiment  with  policies,  since  poNciee  can  be  changed  even  while  the  system  to  In  use. 
Furthermore,  this  makes  poaatoto  a  new  area  of  research  In  concurrency  control  design:  that 
of  designing  concurrency  controls  that  dynamically  adapt  to  system  usage  so  as  to  opdmtae 
performance. 

The  property  that  the  concurrency  control  remains  fundamentally  correct  regardless  of  the 
podcy  to  also  useful  for  designing  and  maintaining  poNctos,  since  the  policy  designer  has  a 
high  degree  of  freedom.  As  an  Mustradon  of  this  freedom,  in  Chapter  5  a  set  of  basic 
poflclee  to  developed  in  which  ad  transactions  are  treated  uniformly  without  priority:  the 
result  to  330  dtoHnct  policies,  these  polictos  should  be  considered  only  as  a  beginning  to 
the  study  of  poflctos,  since  in  practice  tosra  are  a  variety  of  extensions  ft«t  wM  pr*m 
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valuable.  Two  extensions  that  are  often  necessary,  deadlock  detection  and  queuing  of 
requests,  are  described. 

In  Chapter  6  several  designs  for  a  global  memory  manager  are  developed.  This  subsystem 
supports  multi-version  objects,  and  is  used  to  solve  the  granularity  problem  for  queries  by 
making  it  unnecessary  for  queries  to  Interact  with  the  concurrency  control.  A  design  for 
garbage  collection,  in  which  old  versions  of  objects  are  deleted  at  the  earliest  point  at  which 
it  can  be  guaranteed  that  they  wik  never  agate  be  accessed,  is  also  presented. 

A  complete  transaction  processing  system  using  these  designs  was  developed  for  the  Cm* 
distributed  multi-microprocessor.  This  system  used  a  concurrency  control  in  which  all 
functions  required  by  tee  paradigm  were  available,  and  the  concurrency  control  used  a 
policy  module  implementing  all  basic  policies,  any  of  which  could  be  chosen  at  run-time. 
Record  managers  were  replicated,  and  copies  of  shared  data  objects  were  cached.  The 
throughput  limitations  of  this  system  were  investigated,  and  experiments  were  performed 
using  several  policies.  This  system  and  tee  experiments  are  described  in  Chapter  7.  A 
generalization  of  the  record  manager  used  in  this  system  is  given  in  Appendix  I,  and 
algorithms  developed  for  tee  concurrency  control  are  given  in  Appendix  II.  These  algorithms 
should  apply  directly  to  any  transaction  processing  system  using  the  framework  of  Chapter  2. 

Chapter  8  contains  conclusions  and  a  discussion  of  further  research. 

1 .5.  Relationship  to  Previous  Work 

In  early  work  on  the  concurrency  control  problem,  the  two-phase  locking  protocol  was 
developed  (fEswaran  at  at  76],  [Steams  et  at  78]).  An  implicit  assumption  behind  two-phase 
locking  is  that  transactions  should  be  controlled  so  as  prevent  aborts  if  at  aH  poaatola. 
Starting  from  a  different  premise,  that  transactions  should  never  wait  for  accsae  to  an  object, 
a  radically  different  optimistic  method  for  concurrency  control  was  developed  in  [Kung  and 
Robinson  81].  As  an  example  of  tee  difference  between  the  two  approaches,  in  an  optimistic 
method  deadlock  will  never  occur  (since  transactions  never  wait),  and  so  deadlock  detection 
is  unnecessary;  on  the  other  hand,  using  sn  optimistic  method,  transactions  are  much  more 
likely  to  be  aborted.  What  tee  performance  differences  would  be  between  the  two 
approaches  in  any  given  application  was  unknown.  This  suggested  the  problem  of  designing 
a  more  general  concurrency  control  that  would  be  capable  of  both  methods;  In  an 
application,  experiments  could  easily  be  conducted  (since  both  methods  are  fundamentally 
correct  in  tee  same  sense),  and  tee  "best"  method  could  then  be  used.  This  problem  was 
the  starting  point  for  this  thesis,  generalized  as  follows:  the  general  concurrency  control 
should  be  capable  of  any  method  that  satisfied  certain  explicit  assumptions.  Of  those 
assumptions,  the  primary  one  is  that  the  concurrency  control  muet  guarantee  sertsliabOy 
(see  Section  2.5);  the  remaining  assumptions  have  to  do  with  appMcstion  tndapendanos  and 
practicality  (see  Section  4J). 

no  prococoi  concurrency  convoi  con  no  0099000  wokhw  roooro  vor  mo  jvmuono  prwn, 
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but  as  discussed  above,  this  seems  to  be  a  significant  problem  only  for  queries.  This 
problem  is  nicety  solved  using  multi  version  objects,  and  is  especially  appropiale  for  the 
concurrency  control  design  developed  here,  since  seriaRzabBity  la  guaranteed  in  an  explicitly 
known  order  (see  Section  4.5  for  a  discussion  on  why  this  is  necessary).  The  use  of  mulll- 
version  objects  in  distributed  database  systems  has  previously  been  investigated  in  (Reed 
78].  The  major  differences  between  their  use  by  Reed  and  their  use  here  are:  (1)  as  used 
by  Reed,  version  numbers  (pseudo-times  in  Reed's  terminology)  are  determined  during  the 
course  of  a  transaction,  whereas  here,  sequentially  numbering  the  succeeaful  commits  of 
transactions,  the  version  numbers  of  the  objects  written  by  a  transaction  are  the  same  as  its 
commit  number;  (2)  due  to  the  distributed  nature  of  version  number  assignment,  Reed  must 
largely  avoid  the  garbage  collection  problem,  and  queries  require  concurrency  control 
support,  whereas  here  version  numbers  are  managed  by  conceptually  centralized  antities, 
with  the  results  that  garbage  collection  algorithms  can  be  developed,  and  that  queries  do  not 
require  concurrency  control  support. 

The  mapping  of  the  transaction  processing  system  framework  of  Chapter  2  to  a  distributed 
architecture  was  influenced  by  the  Medusa  operating  system  (a  general-purpose  operating 
system  for  Cm*  -  see  [Ousterhout  et  a!  80]),  and  by  the  work  of  Qarcia-Molina  (see  [Garcia- 
Molina  79]).  If  Medusa  were  extended  to  be  a  darabaae  operating  system  (in  the  senes  of 
[Gray  78]),  using  the  mapping  of  Chapter  2,  the  global  memory  manager  and  concurrency 
control  would  be  seen  as  new  utilities,  and  the  local  memory  manager  would  be  seen  as  a 
new  type  of  kemei.  An  alternative  design  would  be  to  include  the  concurrency  control  and 
global  memory  mansger  as  part  of  a  kemei  (a  copy  of  which  runs  on  every  node)  -  however, 
Garcia- Molina  simulated  a  variety  of  centralized  and  distributed  concurrency  controls,  and 
found  that  the  centralized  concurrency  controls  in  most  cases  gave  better  performance. 

Finally,  the  literature  for  concurrency  control  has  now  grown  very  targe,  as  can  be  seen  from 
the  recent  survey  [Bernstein  and  Goodman  81].  Although  much  of  the  previous  work  on 
concurrency  control  has  had  a  strong  influence  here,  as  Bernstein  and  Goodman  conclude, 
all  the  various  designs  can  be  seen  as  combinations  and  variationa  of  a  tew  basic 
techniques.  A  major  difference  of  the  approach  here  is  that  fundamental  correctness  has 
own  upBroioQ  if  Of  i*  pcwcy.  inn  ipprrarn  wra  vnomma  oy  era  oral  9*1  of  ira  nyon 
operating  system  (sse  [Wulf  et  ai  74)),  and  by  the  work  of  Everhart  [Everhart  79]. 
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2.  Design  Principles 

An  overaN  structure  for  transaction  processing  systems  WIN  now  be  described,  based  on 
inlormatioivhiding/dats- independence  principles.  This  is  followed  by  a  formal  development 
of  the  notions  of  ssriaNzabNity  and  conflicts,  forming  a  basis  for  concurrency  control  design. 

2.1 .  On  the  Criteria  to  be  Used  in  Decomposing  (Transaction 
Processing)  Systems  into  Modules 

The  title  of  this  section  refers,  of  course,  to  the  paper  by  Psmss  [Pam as  72).  As  discussed 
there  and  elsewhere,  one  approach  to  system  design  is  information-hiding,  i.e.,  the 
decomposition  of  the  system  into  modules  based  on  difficult  or  changeable  design  decisions, 
with  each  module  designed  so  as  to  hide  one  decision  from  the  others.  This  approach  has 
been  demonstrated  to  have  great  advantages  in  decreasing  system  development  and 
maintenance  time,  and  in  increasing  confidence  in  system  correctness.  These  advantages 
are  widely  believed  to  outweigh  any  resultant  efficiency  disadvantages,  if  in  tact  any  such 
disadvantages  do  result 

The  information- hiding  principle,  when  applied  to  database  design,  has  come  to  be  Known  as 
data-independenca,  which  for  example  Date  defines  as  "the  immunity  of  applications  to 
change  in  storage  structure  and  access  strategy"  [Data  77]. 

The  most  far-reaching  decisions  in  database  design  are  thoee  of  choice  of  data  models. 
This  is  comparable  to  the  mors  general  problem  of  choice  of  data  structures  in 
programming.  A  problem  in  database  design  is  that  there  are  many  data  models  that  are 
variously  most  appro piate  depending  on  the  level  at  which  the  deeign  is  approached,  ranging 
from  the  physical  access  level,  where  one  would  naturally  like  to  think  of  a  "date  Hem”  as  e 
disk  block  or  segment  to  the  user  interface  level,  where  one  might  think  of  a  "data  Hem”  as 
a  record  of  some  type,  or  as  a  collection  of  records  of  the  same  type  (e.g.,  a  relation),  or 
perhaps  as  some  kind  of  entity  suitable  for  display  on  a  CRT. 

This  is  a  common  problem  of  large  systems,  and  can  be  solved  by  abstraction.  In  database 
design,  separation  of  the  design  decisions  on  data  models  at  throe  levels  of  abstraction  - 
the  physical  access  (Internal)  level,  the  logical  access  (conceptual)  level,  and  the  ueer 
access  (external)  level  -  has  led  to  the  so-called  ANSI/SPARC  throe-level  DBMS  (database 
management  system)  architecture  (see  [Jardine  77]  for  a  presentation  and  discussions  of  Ms 
architecture). 

This  architecture  has  the  property  that  various  different  external  data  models  may  be 
simultaneously  supported.  Similarly,  different  internal  data  models  may  be  supported.  Thus, 
this  architecture  lands  Hssif  to  dais-independence  -  the  conceptual  model  provides  a  fixed 
interface  between  changeable  application  programs  (i.e.,  extemal/conceptuai  mappings)  and 
changeable  storage- organizatton/access  programs  (La.,  conceptual/intsmal  mappings) 

In  summary,  this  three- lavs  I  architecture  la  shown  in  Figure  2.1.  Haro,  a  user  tatata set 
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Figu re  2. 1 .  Th ree- Level  A  rchitectu ra 


is  a  collection  of  modules  providing  a  single  extemal/conceptual  mapping,  and  the  record 
manager  is  a  collection  of  modules  that  define  the  conceptual  model  and  its  mapping  to 
physical  storage. 

2.2.  Problems  of  the  Three-Level  Architecture 

The  three-level  architecture  seems  to  deal  adequately  with  the  problem  of  teotaUng 
application  programs  from  storage  structures  and  access  strategies,  providing  one  is  wNNng 
to  accept  the  eeaentiaHy  static  nature  of  the  conceptual  model.  However,  it  provides  no 
such  isolation  at  the  physical  access  level  (which  could  either  be  at  the  level  of  physical 
storage  devices,  or  at  the  level  of  virtual  storage  as  seen  through  a  host  operating  system). 
Typically,  the  record  manager  has  detailed  knowledge  of  available  storage  and  its 
characteristics,  in  addition  to  implementing  concurrency  control  and  recovery. 

Apparently  this  has  not  been  a  significant  problem  in  the  past,  as  far  as  system  correctness 
is  concerned,  since  there  are  rail  able  transaction  proceeding  systems  in  existence.  However, 
it  has  probably  contributed  significantly  to  the  development  time  for  these  systems.  Ideally, 
one  would  Ike  to  take  existing  record  management  software,  or  new  software,  and  use  It  In  a 
system  without  regard  to  the  underlying  machine  architecture.  Also,  one  might  want  to 
change  record  management  policies  as  system  usage  changes  (as  part  of  syteem 
maintenance),  implement  a  new  record  manager,  or  correct  bugs  in  an  existing  record 
manager,  without  regard  to  posaibls  interactions  with  concurrency  control  or  recovery  that 
could  causa  these  subsystems  to  becomo  incorrect. 

These  problems  of  the  three-level  architecture  wHI  become  ever  mote  significant  given  the 
mcmwmngfjf  complex  nvowin  provtoea  uy  vmimpniCMiovi,  omnovivu 
computer  networks,  and  memory  hierarchies.  In  fact,  system  correctness  could  very  wefi 
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Figure  2.2.  Four-Level  Architecture 


2.3.  A  Four-Lovol  Architecture 

.  • 

Separating  record  management  deeign  into  two  levels  —  the  vlrtuel  and  phyaical  levels  — 
results  in  a  general  architecture  lice  that  of  Figure  2.2.  Here,  the  phylcl  Harm  modi 

iW8  lO  in®  naVuWBFO  II80II  (PygMCliy  eNn  uiTwUyii  8  Q0nW*PwpO9v  OPwnWny  OffO 

the  virtual  internal  model  refers  to  a  data  model  based  on  a  collection  of  abstract  data 

aWIaa4a  Aak4t  1m  wJklAlk  aH  jIaSaHa  aJ  AAAMIAHHMM*  aaAAfeAA*  Id^AAMA^lAA  aajS  aAAAuAAte  AAA 

ODf0c*9f  ano  in  wnicn  sn  ocisns  of  concunwivyi  Hwfnofy  nivfvcnWh  BfeCi  iwoowy  Vw 

— 

laOOfn. 

The  data  oblacta  suooortad  at  the  virtual  internal  level  could  ha  of  varvlno  oomnlsaltv. 

oipnnoiny  Off  uw  Q6Pyn>  ooinv  wXWii^^8i  ai  wm*  w  wiCiWWip  OQvn^WMft 

segments,  and  records  (see  Section  &2).  Hafarlng  to  Figure  22  iw  record  isampr  ta 
ippwiwi  TO*  mapping  wm  ovuidn  otmaj  ay  nt  ujmvrm  or  mooR  ohb® 
obfocts*  accruing  bn  objact  uafog  only  tw  opMdono  (MM  on  thoi  obM  Ml  MMiy 

vnanagar  a  Foapoffaiofa  »of  ffiap^Rvg  vnNa  wwi  ^vyaMR  ^rb^vi  bo  tRBiRoawRBs^ 
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Figure  2.3.  Centralized  System 


2.4.  Example  Systems 

The  four-level  architecture  can  be  mapped  to  a  variety  of  architectures  in  a  variety  of  ways. 
Three  examples  wiH  now  be  given. 

2.4.1 .  A  Centralized  System 

The  four-level  architecture  can  be  used  in  a  straightforward  way  in  a  centralized  system,  as 
shown  in  Figure  2.3.  Note  that  in  this  system,  record  managers  can  share  cede,  and  that  toe 
record  managers,  memory  manager,  and  recovery  manager  can  share  buffer  space.  This 
could  iuar  as  wsN  be  a  tiahtfv  coupled  mtrttJoroceosor  system,  that  is.  a  muMoroosaeor 
system  in  which  access  to  shared  memory  is  equally  inexpensive  for  alt  processors. 

2.4.2.  A  Personal  Computer  Network  System 

One  ooretole  maonino  of  the  four-level  architecture  to  a  comoutar  network  is 

ym^s^^anw  a  rw  rarw  -ev^n  saanrfsmwHm^s  sea  e  yre*  ^nr*  reu  wwrqrt^^^n  wi 

mown  in  Figure  2.4.  In  this  system  there  is  «  more  complex  memory  hierarchy:  Stare  are  a 

nUrnDaf  V»  •IsImW  IOvV  -®nO  •  IV9w  mfeSwO  W9 

problem  is  function e/  (attribution,  in  this  epproech,  ecceee 
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Q  processor  I  I  •  modulate)  — -  •  message  interface 

□  -  process  ■  ■  ■  device  interface  —  •  procedure  call  interface 

Figure  2.5.  Multi-Microprocessor  System 

memory  managers  at  each  processor,  each  managing  all  local  memories  as  if  this  were  one 
large  shared  resource.  The  same  approach  could  be  applied  to  the  concurrency  control  or 
to  the  recovery  manager.  Some  disadvantages  of  this  approach  are:  (1)  the  required 
process  communication  is  much  more  extensive:  (2)  problems  such  as  distributed  deadlock 
seem  very  difficult  to  handle  efficiently:  and  (3)  system  corroctnaw  is  in  general  mors  in 
doubt,  due  to  the  added  complexity  of  the  system.  In  tact,  in  the  case  of  concurrency 
control,  simulation  studies  by  Oarcia-Mokna  (see  (Garcia-MoHna  79])  have  shown  that  in  a 
variety  of  cases,  centralized  concurrency  controls  are  much  more  efficient  than  comparable 
distributed  concurrency  controls. 
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2.4.3.  A  Multi-Microprocessor  System 

A  system  using  the  designs  of  Chapters  3-6  was  implemented  on  Cm*,  a  multi- 
microprocessor  system  (see  Chapter  7).  A  proposed  decomposition  for  multi-microprocessor 
transaction  processing  systems  is  shown  in  Figure  2.5.  This  was  the  decomposition  used  in 
the  implementation,  except  that  the  user  interfaces  were  combined  in  a  single  master 
process  that  simulated  the  transactions  produced  by  a  number  of  users,  and  an  extremely 
simple  recovery  mavger  was  implemented  as  part  of  the  global  memory  manager.  The 
decomposition  of  Figure  2.5  is  essentially  the  decomposition  of  Figure  2.4;  however, 
address-space  limitations  have  resulted  in  some  further  decomposition,  as  shown. 

In  the  case  that  the  concurrency  control  or  global  memory  manager  become  system 
bottlenecks,  it  is  possible  to  distribute  these  among  several  processors  without  sacrificing 
functional  decomposition  by  partitioning  the  set  of  shared  data  objects  (see  Sections  4.6  and 
6.5). 

2.5.  Serializability  and  Conflicts 

In  this  final  section  the  problem  of  concurrency  control  design  is  addressed.  The  goal  here 
is  to  obtain  a  formal  characterization  of  the  kinds  of  interactions  under  concurrency  that  can 
lead  to  loss  of  consistency,  assuming  only  that  each  transaction  individually  preserves 
consistency. 

Let  the  following  constants  and  sets  be  given: 

A,  the  number  of  transactions  in  the  .system; 

O,  the  set  of  object  IDs; 

D,  the  set  of  object  states  (all  possible  data  parts  of  an  object); 

'3  t,  the  set  of  transaction  states,  including,  in  particular,  a  halting  state; 

R,  W,  read  and  write  symbols  (arbitrary,  different  constants). 

First,  versions  of  objects  and  database  states  will  be  defined. 

Definition.  A  version  is  a  triple  <o,  v,  d>,  where  o  €  0,  v  is  an  integer,  0  £  v  <[  k, 
and  d  €  0.  In  the  triple  <o,  v,  d>,  o,  v,  and  d  are  called  the  object  id,  version 
number,  and  data,  respectively.  A  database  state  is  a  set  of  versions  satisfying: 

(1)  for  every  o  €  0,  there  is  a  version  with  a  zero  version  number,  <o,  0,  d>; 

(2)  for  all  object  IDs  o  and  version  numbers  v,  there  is  at  most  one  version 
with  ID  o  and  version  number  v. 

Next,  transaction  ste  e  and  transactions  are  defined. 

Definition.  A  transaction  step  is  a  sextuple  <1,  /,  C,  P,  R,  tV>,  where  /  and }  are 
integers,  1  <£/£*,  and  C,  P,  R,  and  W  are  any  functions 

C:  7  “*  {R,  W), 
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P:T-*0, 

R:  DxT  -*  r,  and 
IV:  7  -*  0x7. 

Hero,  /  is  called  the  transaction  number,  I  is  called  the  sequence  number,  and  C,  P, 

R,  and  W  are  called  the  conditional  function,  parameter  function,  read  function,  and 
write  function,  respectively.  A  transaction  is  a  finite  sequence  of  transaction  steps, 
all  with  the  same  transaction  number,  and  with  unique,  increasing  sequence 
numbers. 

In  formalizing  the  notion  of  transaction  systems,  it  is  convenient  to  use  a  variable  state 

vector  <s,  ty  t2,  t3 . tk>,  where  s  is  a  database  state  and  each  t,  is  a  transaction  state. 

Given  the  current  value  of  die  state  vector  and  a  transaction  step  <i,  i,  C,  P,  R,  W>,  a  new 
value  of  the  state  vector  is  produced  as  follows.  First,  if  t,  is  the  halting  state,  the  state 
vector  is  unchanged.  Otherwise,  C  is  applied  to  tr  The  result  of  C  determines  which  of  the 
following  two  cases  apply. 

Read  case:  0(1,)  »  R.  In  this  case,  R  is  then  applied  to  <d,  t,>.  where  d  is  the  data 
of  that  version  with  object  ID  Pit,),  and  with  the  greatest  version  number  less  than 
or  equal  to  /.  The  result  of  R  is  a  transaction  state,  which  is  the  new  value  of  f;  in 
the  state  vector;  the  rest  of  the  state  vector  is  unchanged. 

Write  case:  C{tj  «  W.  In  this  case,  IV  is  then  applied  to  t,,  giving  a  data  value  d 
and  a  transaction  state  t.  The  new  value  of  the  state  vector  is  derived  by:  (1) 
setting  1,  to  t;  (2)  modifying  s,  first  by  removing  the  version  with  object  ID  Pit,)  and 
version  number  /  from  s  if  there  is  such  a  version,  and  then  by  adding 
the  version  i,  d>  to  s;  and  (3)  leaving  the  rest  of  the  state  vector  unchanged. 

Next,  serial  and  concurrent  transaction  systems  are  defined. 

Definition.  A  serial  transaction  system  is  any  sequence  of  transaction  steps 
formed  by  appending  k  transactions,  with  transaction  numbers  1,  2,  3,  ....  Jr,  in  this 
order.  A  concurrent  transaction  system  is  any  sequence  of  transaction  steps 
j  formed  by  permuting  the  steps  of  a  serial  transaction  system  subject  to  the 

constraint  that  the  sequence  numbers  for  each  transaction  remain  increasing  for 
that  transaction. 

J  • 

\  A  aerial  or  concurrent  transaction  system  can  be  applied  to  an  initial  value  ot  the  state 

|  vector  by  applying  each  step  of  the  system  in  sequence,  yielding  a  final  value  of  the  elate 

vector.  A  transaction  history  of  this  process  is  a  sequence,  initiaffy  empty,  formed  by 
appending  a  quadruple  for  each  step  in  the  transaction  system,  with  the  exception  of  stipe 
lor  transactions  in  the  halting  state.  This  takes  place  as  follows.  Let  the  current  transaction 
step  be  <i,  U  C,  P,  R,  W>.  If  f,  is  the  halting  state,  the  transaction  history  Is  unchanged. 

UwWnMMs 

If  C ity  »  R  and  Pty)  »  o,  append  <R,  I,  j,  o>  to  the  history. 
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if  C(tf)  -  W  and  Pit,)  •  o,  append  <W,  i,  j,  o>  to  the  history. 

Now,  consider  die  problem  of  preserving  consistency.  First,  if  each  transaction  individually 
preserves  consistency,  a  serial  transaction  system  dearly  preserves  consistency.  Therefore, 
if  a  concurrent  transaction  system  were  somehow  equivalent  to  a  aerial  transaction  system, 
i.e.  serializable,  It  too  would  preserve  consistency.  In  fact  it  has  been  shown  in  [Kung  and 
Papadimitriou  79]  that  this  is  the  weakest  such  condition  for  a  concurrent  transaction  system 
to  preserve  consistency  if  the  consistency  properties  are  not  Known  (and  of  course,  they  are 
not  known  in  an  application-independent  concurrency  control).  This  leads  to  the  following 
definition. 

Defintion.  A  concurrent  transaction  system  is  serializable  (in  the  order  1,  2 . At) 

if,  when  applied  to  any  initial  value  of  the  state  vector,  the  final  value  is  identical  to 
the  final  value  produced  by  applying  the  aerial  transaction  system  from  which  the 
concurrent  transaction  system  was  formed.  A  transaction  history  is  serializable  (in 
the  order  1,  2,  ....  At)  If  all  concurrent  transaction  systems  with  this  history  are 
serializable. 

This  definition  is  somewhat  different  than  that  usually  appearing  in  the  literature,  in  that  the 
serrializability  order  is  assumed  to  be  given.  The  question  of  whether  this  lack  of  generality 
in  the  definition  above  leads  to  any  lack  of  generality  in  the  concurrency  control  is  taken  up 
in  Chapter  4  (the  answer  seems  to  be  that  it  does  not). 

Now,  conflicts  are  defined. 

Definition.  Given  a  transaction  history  containing  <R,  /,  j,  o>,  let  <W,  f ,  f,  o>  be 
that  quadruple  in  the  history  with  maximal  7  and  7,  subject  to  f  <,  I,  if  such  a 
quadruple  exists.  Then  transactions  /  and  7  conflict  if  7  <  /  and  <R,  I,  j,  o>  precedes 
<W,  7,  f,  o>  in  the  transaction  history. 

The  result  of  this  section  is  the  following  theorem. 

Conflict  Theorem.  Assume  that  0  has  more  than  one  element  and  that  T  has 
more  than  one  non-halting  stale.  Than  a  transaction  history  ia  aeriaHrable  if  and 
only  if  no  two  transactions  conflict  in  the  transaction  history. 

Proof. 


(4*)  Given  a  concurrent  transaction  system  and  an  initial  value  of  the  state  vector,  note 
(1)  the  state  vector  transition  produced  by  a  transaction  step  <1,  /,  C,  P,  R,  W>  depends 
on  the  current  value  of  tr  and,  in  the  rood  case,  that  previous  transaction 
<7,  F,  C\  P\  R\  nr>  with  maximal  7,  y\  7  £  /,  that  was  a  write  to  object  Pty.  *  such  a 
exists;  and  (2)  the  only  transaction  state  fit  the  state  vector  changed  by  this  transaction 
is  tr  Therefore,  if  there  are  no  conflicts  in  the  transaction  history,  all  stale  vector  trami 
in  the  concurrent  transaction  system  will  be  the  same  as  the  stats  vector  hanstikm  i 
Mfisi  uofmciion  syswn  ivuvn  wwcn  wm  concvnini  tmmuii  vhhmr  wm  wurr 
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(-+)  Given  a  transaction  history  with  a  conflict,  a  concurrent  transaction  system  and  an 
initial  value  of  the  state  vector  will  be  found  such  that  the  final  value  of  the  state  vector  is 
not  the  same  as  the  final  value  produced  by  the  corresponding  serial  transaction  system. 
Since  there  is  a  conflict,  the  transaction  history  is  of  the  form: 

.v<R,/,y,o>...<W.r,/,o>...  (f<0- 

Let  D  -  {*,  y, ...}  and  r  -  {halt,  a,  b, 0  has  at  least  one  element  (namely  o);  let  O  « 
(o,  p,  q,  ...}.  Let  dm  initial  state  vector  be 

K  {<o,  0,  x>,  (p,  0,  r),  (q,  0,  x>,  a,  a,  a, ... 

that  is,  every  object  has  one  version  with  version  number  0  and  with  data  x,  and  every 
transaction  is  in  state  a.  Let  there  be  one  transaction  step  in  the  concurrent  transaction 
system  for  every  quadruple  in  the  transaction  history,  defining  the  conditional  and  parameter 
functions  of  each  step  so  as  to  agree  with  the  history.  Now  define  every  write  step  in  the 
concurrent  transaction  system,  except  for  the  one  corresponding  to  <W,  r,  f,  o>  above,  as  a 
write  with  data  x  and  a  transition  to  the  current  transaction  state.  Define  the  remaining  write 
step  as  a  write  with  data  y  and  a  transition  to  the  current  transaction  state.  Finally,  define 
every  read  step  as  a  transition  to  the  current  transaction  state,  except  for  the  read  step 
corresponding  to  <R,  /,  /,  o>  above;  define  the  read  function  R  of  this  read  step  as 
R(x^)  -  a,  R{yj)  m  b.  This  concurrent  transaction  system  is  not  serializable,  since  the  Anal 
value  of  f,  is  a,  but  serially  the  final  value  is  0.1 

This  simple  and  exact  characterization  of  serializable  transaction  histories  is  poasfele 
primarily  due  to  the  inclusion  of  an  explicit  total  ordering  of  transactions  in  the  definition  of 
seriaiizabiiity,  and  to  she  multi-version  definition  of  objects.  When  these  are  omitted  the 
characterization  of  serializable  histories  becomes,  by  comparison,  highly  complex  -  in  fact 
the  problem  of  determining  if  a  transaction  history  is  serializable  in  any  order,  even  under  a 
much  simpler  data  model,  has  been  shown  to  be  NP-complete  (see  [Papadimltriou  7®]). 

Transaction  histories  are  useful  for  concurrency  control  design  since  transaction  histories 
formalize  the  information  available  to  an  application-independent  concurrency  control.  Sinoe 
the  concurrency  control  is  application- independent,  it  must  not  aNow  any  transaction 
histories  to  develop  that  could  have  been  produced  by  some  non-aerializabte  concurrent 
transaction  system  -  this  was  the  motive  for  the  definition  of  serializable  transaction  histories 
given  above.  Finally,  the  conflict  theorem  provides  a  simple  way  to  teat  for  non-aerializabte 
transaction  histories.  For  the  system  described  here,  the  test  is  actually  somewhat  simpler 
than  might  be  expected  from  the  above,  since  for  each  transaction,  ail  reads  precede  al 
writes  (to  shared  objects),  and  there  is  at  most  one  write  to  a  shared  object  The  conflict 
theorem  wHI  be  applied  in  Chapter  4. 
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3.  System  Overview 

In  Chapters  4,  5,  and  6,  designs  for  the  concurrency  control  and  global  memory  manager 
subsystems  of  the  four-level  architecture  will  be  developed.  This  chapter  provides  an 
overview  of  a  complete  system  using  this  architecture.  Global  design  decisions,  such  as 
communication  protocols  between  subsystems,  are  discussed.  It  is  assumed  that  the  system 
is  distributed,  so  that  the  memory  manager  has  been  decomposed  into  local  and  global 
memory  managers,  as  described  in  Chapter  2.  Although  this  overall  design  was  developed  in 
the  course  of  the  Cm*  implementation  (Chapter  7),  it  should  apply  to  any  transaction 
processing  system  using  the  four-level  architecture.  The  following  abbreviations  wiR  be 
used:  RcdM  -  record  manager,  LMM  -  local  memory  manager,  GMM  -  global  memory 
manager,  CC  -  concurrency  control,  RcvM  -  recovery  manager. 

3.1.  Communication  between  Subsystems 

It  is  convenient  to  have  communication  between  LMMs  and  the  GMM,  LMMs  and  the  CC, 
and  between  the  GMM  and  the  RcvM  take  place  via  messages  ••  in  this  case  no  other 
synchronization  wifl  prove  necessary.  The  assumptions  here  are  that  there  is  a  message 
buffer  associated  with  each  process;  that  sending  a  message  to  a  process  causes  the 
message  to  be  placed  at  the  end  of  the  message  buffer  for  that  process  if  posstele. 
otherwise  the  sending  process  waits  until  it  is  posstoie  to  do  so,  with  queueing  of  waiting 
processes;  and  that  receiving  a  message  removes  the  first  message  from  the  message  buffer 
if  there  is  one,  otherwise  the  receiving  process  waits  until  there  is  a  message  to  remove. 

Each  RcdM  and  LMM  are  part  of  the  same  process,  and  they  share  a  common  address 
apace.  Communication  between  the  RcdMs  and  LMMs  can  take  place  by  procedure  cafls. 

3.2.  Data  Objects 

At  the  virtual  internal  level  the  database  consists  of  a  collection  of  data  objact*  (when  the 
context  is  dear,  simply  abject),  each  identified  by  a  unique  ID.  A  data  object  wifl  be  the  unit 
of  data  transfer  between  local  memories  and  shared  memory. 

For  the  Cm*  system,  pages  (unite  of  untyped  storage  of  fixed  size)  were  chosen  as  the  data 
objects  of  the  virtual  internal  level  primarily  for  simplicity.  There  are  only  three  operations 
defined  on  a  page  -  read,  write,  and  delete  -  in  addition  to  the  operation  of  creating  a  new 
Pbqq.  a  more  aovancoo  systoni  coupu  proviov  more  cofnpvox  ootocis,  won  m  N0iMivvi 
(units  of  untyped  storage  of  variable  size)  and  records  (untie  of  structured  storage  --  see 
Appmoix  i;.  ure  aavwit&go  or  only  utmg  poqm  ioc  Mfrentt)  is  gonof iMys  # 
commitment  is  made  to  any  particular  data  modal.  On  the  other  hand,  providing  record 
objects  at  the  virtual  internal  level  could  be  far  mors  efficient,  and  this  approach  would 
certainly  be  taken  if  record  access  hardware,  such  as  logic-per-track  disks,  were  available. 
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some  type,  this  will  be  trivial  it  the  conceptual  entities  are  records  of  the  same  type.  The 
mapping  becomes  more  complex  as  the  difference  between  the  conceptual  model  and  the 
virtual  internal  model  grow.  In  the  case  that  data  objects  are  pages  and  conceptual  entities 
are  records,  various  mappings  can  be  achieved  by  organizing  the  database  as  a  directed 
graph  of  pages,  with  certain  root  pages  that  are  never  created  or  deleted  --  access  to  a 
record  takes  place  by  first  accessing  a  root  page,  then  following  pointers  to  a  page  (or 
pages)  containing  the  desired  record.  A  simple  example  of  this  kind  of  mapping  is  to  link  a 
number  of  pages  together  linearly  so  as  to  form  a  sequential  file;  for  a  much  more  complex 
example,  see  Appendix  I. 

3.3.  Read  and  Write  Phaaea 

As  noted  in  Chapter  1,  in  general,  any  transaction  may  be  aborted.  Therefore,  in  order  to 
avoid  unnecessary  transfer  of  data  objects,  all  writes  to  the  shared  database  wM  be  buffered 
until  the  end  of  the  transaction.  This  results  in  a  read  photo,  in  which  the  transaction  is 
executed  but  does  not  write  to  the  shared  database,  and  then  if  the  transaction  is  not 
aborted,  a  write  phase  in  which  all  modified  data  objects  are  transferred  to  the  shared 


Some  other  reasons  for  choosing  this  approach  are  (1)  it  simplifies  the  QMM  design;  (2)  the 
LMM  cannot  determine  if  a  transaction  will  later  modify  an  already  modified  object,  without 
adding  complexity  to  the  RcdM/LMM  interface;  and  (3)  the  object  versions  are  not  known 
until  the  end  of  the  read-phase  (see  the  next  section). 

3.4.  Transaction  Numbering  and  Versions 

In  Chapter  4,  it  will  be  seen  that  the  concurrency  control  win  guarantee  that  the  system 
transaction  history  is  always  serializable.  Furthermore,  the  seriatizabKity  order,  that  is  the 
transaction  numbering,  will  be  made  explicit.  Thus,  the  database  can  be  thought  of  as  a 
sequence  of  versions  0Q,  Dv  02, ....  where  O0  is  the  initial  database,  and  0,  is  the  database 
after  sequential  execution  of  the  transactions  numbered  1,  2,  3, ....  i,  in  this  order.  Assuming 
transactions  are  executed  sequentially,  if  each  transaction  actually  wrote  a  new  version  D,  of 
the  entire  database,  this  new  version  would  possibly  be  inconsistent  until  the  transaction 
completed.  But  if  version  DM  were  stM  available,  queries  that  began  before  the  transaction 
numbered  i  had  completed  could  still  "see"  a  consistent  database  by  accessing  this  older 
version.  Under  concurrency,  title  approach  can  be  simulated  by  having  each  transaction 
write  new  versions  only  of  the  objects  it  modifies.  This  scheme  results  in  the  database 
consisting  of  a  collection  of  objects  of  one  or  more  versions  each,  where  the  version  number 
of  an  object  is  the  same  as  the  transaction  number  of  the  transaction  that  wrote  the  object 

This  multi-version  object  scheme  wM  be  used  here.  This  was  the  motive  for  defining 
versions  of  objects  tit  the  earlier  formal  development  of  seriaRzabMty.  The  details  of 
providing  queries  with  a  consistent  view  of  the  database  without  CC  support,  and  of  garbage 
coNectina  old  versions,  are  given  in  Chapter  8.  Transactions  are  asouontiallv  numbered  at 
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successful  read-phase  completions;  the  reasons  for  this  are  discussed  in  Chapter  4. 

One  objection  sometimes  raised  to  multi-version  object  schemes  is  that  a  large  transaction 
that  modifies  the  entire  database,  for  example  for  reorganization  purposes,  will  create  a  new 
version  of  every  object,  thus  doubting  the  needed  storage.  However,  in  single- version 
schemes,  how  is  recovery  supported  in  this  case?  That  is,  what  if  as  a  result  of  hardware 
failure,  software  errors,  or  human  mistakes,  the  database  is  "destroyed"  by  this  large 
transaction?  This  is  a  real  possibility,  and  one  would  hope  that  even  in  single- version  object 
based  systems  an  earlier  version  of  the  entire  database  were  saved  somewhere  in  this 
eventuality.  In  fact,  this  suggests  the  following  solution  for  multi-version  object  based 
systems:  as  the  large  transaction  runs,  transfer  the  earlier  version  of  each  modified  object  to 
tape  or  other  tertiary  memory,  and  reclaim  the  secondary  memory  space.  Note  that  the 
necessary  mechanism  could  already  be  available  as  an  automatic  archival  subsystem,  or  as 
part  of  the  recovery  subsystem. 

3.5.  Local  Memory  Managers 

The  LMM  will  have  several  responsibilities:  managing  a  local  cache  of  data  objects, 
supporting  the  write-phase,  hiding  the  GMM  and  CC  from  the  RcdM,  and  providing  a  simple 
interface  between  the  CC  and  the  GMM.  In  the  case  that  a  local  disk  is  available,  the  LMM 
could  possibly  participate  with  the  RcdM  in  some  recovery  protocols,  but  this  will  not  be 
considered  here. 

3.5.1.  Cache  Management 

In  order  to  avoid  unnecessary  transfer  of  data  objects,  the  LMM  will  maintain  copies  of  some 
of  the  objects  that  have  previously  been  read  (by  any  local  transaction  or  query).  Every  read 
request  to  the  GMM  includes  the  version  number  of  a  local  copy,  if  such  exists.  No  object 
^  transfer  is  necessary  if  the  local  copy  is  the  "correct"  version  (as  determined  by  the  GMM  - 

see  Chapter  5). 

If  a  particular  transaction  or  query  has  already  read  an  object  and  there  is  still  a  copy,  then 
no  GMM  communication  is  necessary  at  ad.  Whether  or  not  an  object  has  previously  been 
reed  can  be  determined  by  marking  the  local  copy. 

In  the  computer  network  application  (Section  2.4.2),  in  which  a  local  disk  is  avaMMe  to  the 
LMM,  at  each  node  in  the  network  that  part  of  the  database  that  is  most  often  uaed  at  that 
node  wM  migrate  to  that  node.  In  particular,  one  would  expect  at  least  the  upper  levels  of 
the  database  access  structure  (directories  and  indexes)  to  be  present  at  each  node,  leauiling 
in  far  fewer  network  object  tranefera  than  if  caching  were  not  uoed. 

3.5.2.  Write- Phase  Support 

AS  CHSCUSSSO  SDOV 9*  Hi  OVfm  vNHQn  OSCMOfl  IS  10  WTOS  f)SW  VSWOnS  01  OQ|SCiS  10  Ini 
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the  course  of  the  transaction  as  seen  by  the  RcdM  (i.e.  between  T begin  and  Tend  ~  see 
below),  no  writes  to  shared  memory  occur;  instead,  a  local  copy  is  modified.  Then,  after  the 
RcdM  has  called  Tend,  if  die  transaction  is  successful,  the  LMM  writes  all  new  versions  to 
shared  memory.  Thus,  the  LMM  must  maintain  copies  of  all  objects  written,  created,  or 
deleted,  in  order  to  perform  the  write-phase.  In  the  case  that  there  is  not  enough  local 
memory  for  a  particular  transaction,  it  is  easy  to  extend  the  QMM  design  presented  here  so 
that  the  LMM  can  request  extra  shared  memory  for  its  private  use.  In  a  centralized  system 
this  latter  technique  would  always  be  used  since  the  "LMM"  would  not  have  any  local 
memory  to  manege.  ■ 

Note:  a  deleted  object  is  treated  here  as  a  new  version  of  an  object,  i.e.  in  a  fashion  identical 
to  a  written  object,  primarily  for  simplicity. 

3.5.3.  Hiding  Versions  and  Concurrency  from  the  RcdM 

Record  management  problems  can  be  complex,  and  the  complexity  could  become 
unmanageable  if  the  existence  of  multi-version  objects,  read-phases  and  write- phases, 
concurrency  control  had  to  be  dealt  with  at  the  same  level.  It  is  the  responsibility  of 
LMM  to  hide  all  of  this  from  tile  RcdM,  thus  completing  with  the  CC,  GMM,  and  RcvM 
support  of  a  virtual  internal  model.  From  the  point  of  view  of  the  RcdM,  the  database 
consists  of  a  collection  of  objects,  of  one  version  each,  to  which  it  has  exclusive  accees. 

The  LMM  can  hide  ail  of  this  from  the  RcdM  by  mapping  RcdM  object  accesses  to  local 
copies  (retrieving  a  shared  copy  if  necessary),  by  sending  the  necessary  information  to  the 
GMM  to  complete  or  abort  a  transaction,  and  by  sending  the  CC  the  necessary  information 
detect  convucn. 

3.5.4.  GMM  /  CC  Interface 

When  a  transaction  successfully  completes,  the  CC  wW  return  a  transaction  number,  which 
is  just  the  current  value  of  the  transaction  number  counter  (see  Chapter  4).  It  is  the 
reeponefeility  of  the\MM  to  supply  the  GMM  with  this  number  for  version  number  use  during 

inf  wrw'jwnpi* 

3.8.  Summary  of  Subaystamg  and  Intarfacaa 

The  following  is  a  summary  of  the  functions  and  Interfaces  of  the  various  subsystems. 
3.5.1.  RECORD  MANAGER 

Functions.  This  subeytem  defines  the  conceptual  modal  and  Ms  mapping  to  the  date 
objects  of  the  virtual  internal  level.  This  Includes  the  functions  of  definition  of  record 
structures;  mapping  of  records,  relations,  etc.,  onto  objects;  creation  and  maintenance  of 
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3.6.2.  LOCAL  MEMORY  MANAGER 

Functions.  Cache  management;  buffering  until  the  end  of  a  transaction  objects  written, 
created,  or  deleted;  doing  this  in  such  a  fashion  that  the  facts  that  there  are  muMple 
versions  of  objects  and  that  objects  are  shared  is  invisErie  to  the  record  manager;  provide  a 
small,  more  controlled  interface  between  the  RcdM,  GMM,  and  CC. 

Interface. 

Qbogin  •  begin  a  query.  Invokes  UQbagln. 

Oread  •  make  an  object  addressable.  May  invoke  Afread,  may  cache  object  in  local 
memory. 

Qrmlease  -  make  an  object  non -addressable  (free  local  memory  for  object). 

0 end  -  end  a  query.  Invokes  MQand. 

Tbegin  -  begin  a  transaction.  Invokes  Cbegln  and  UTbaght. 

Treed  -  make  an  object  read  addressable.  May  invoke  Creed  or  Urmad,  may  cache 
object  in  local  memory. 

Tcraata  •  create  an  object.  Invokes  Mcreata,  allocates  local  memory. 

Twritm  -  make  an  object  read/write  addressable.  May  invoke  Urmad,  Cwrttm  or  Mneer, 
allocates  local  memory  if  necessary. 

Traieaaa  -  make  an  object  non-addreseable  (may  free  local  memory  for  object). 
Tdmtmta  •  delete  an  object.  Invokes  Unaw,  allocates  local  memory  9  nsoeesary. 
Tabort  •  abort  •  transaction.  Invokes  Cabort  and  M abort. 

Tend  -  complete  read-phase  of  e  transaction.  Invokes  Cvalkt;  then  if  successful  VwfSea, 
writes  new  versions  to  shared  memory;  finally  MTand  and  Cend;  otherwise  Irwekee 
Mabort. 

Tnama  ■  generate  a  unique .  name.  Invokes  Unama. 

3.6.3.  CONCURRENCY  CONTROL 

Functions.  Detect  and  resolve  possible  conflicts  so  as  to  guarantee  sertaHzabHRy; 
transaction  numbering.  Conflicts  are  detected  by  keeping  track,  for  each  transaction,  of  sets 
of  objects  (IDs)  reed  and  written;  conflicts  ate  resolved  by  having  transactions  weft  or 
aborting  transactions. 

Interface. 

Cbegta  •  begin  e  transaction. 

Cabort  -  abort  a  transaction. 

Creed.  Cw rife  -  request  for  access  to  an  object.  Check  for  conflicts  -  return  decision  or 
have  transaction  watt. 

Cvand  ■  Indicates  and  of  read-phass  for  transaction.  Final  confttot  chaok  -  return 
decision  or  have  transaction  welt. 

UmrO  *  IVKimiM  •HQ  Of  Wn^pnM. 
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3.9.4.  GLOBAL  MEMORY  MAMAGBR 

Functions.  Maintaining  object  10,  version  »>  physical  address  mappings;  supplying  each 
qusry  wfth  a  esnaiatant  "snapshot"  of  the  database;  creation  of  new  objects;  garbage 
roiiaction  of  old  and  daleSad  obiects:  nennrstinn  of  irnknia  mmaa. 

interlace. 

MTbep/n,  MQb*gin  ■  begin  a  transaction,  query. 

MTtid.  MOand  -  and  a  transection,  query. 

MaCorf  •  abort  a  transaction. 

e  aann »  ^a  ^p^^wea^w^ap^ps^e*# 

Ahead  •  read  an  object  And  correct  version  and  return  version  number  and  addreea 

9»  WwO  HnwIlOlyi 

Afnew  •  aHocata  apace  for  new  version  of  an  object  (including  "deleted  version"). 
Me  ream  •  create  a  new  object  (allocate  apace  and  return  ID). 

Ahrrite  •  write  an  object  (including  "deleted  version"):  return  address  of  shared  memory 

WOCIWm  n»  RVrrVw  Ql  MOfwlVVi 

Mrmmm  -  generate  and  return  a  unique  name. 

3.9.9.  RECOVERY  MANAGER 

Functions.  During  normal  operation:  vrrtte  each  new  version  (say),  new  values  of  the  write- 
phase  completion  counter  (see  Chapter  9),  other  information  useful  for  recovery  from  faRuree 
on  highly  reliable.  Inexpensive,  write  once  media.  During  recovery,  find  old  versions  in  such 

•  fashion  ao  at  to  edow  rornvanr  of  a  nravtnria  conaiatent  etasa. 

Note.  It  should  often  bo  poaetls  for  the  QMM  to  recover  from  non-diak  system  tenures  -  in 
general,  this  QMM  only  recovery  wM  be  poaatola  If  the  failure  did  not  cauae  garbage  to  be 
written  on  the  disk  in  "eeneblve  areas".  Garbage  can  often  be  detected  by  redundancy 
teehniquea,  e.g.  ohecheurae.  In  the  eimpleet  case,  the  RcvM  could  be  used  pertodlcelly  to 
backup  Pie  entire  oontents  of  system  dWre,  with  no  transactions  allowed  during  this  period. 
example,  in  the  Cm*  eyeiem  described  in  Chapter  7,  a  very  simple  RcvM  waa  written  ae 
of  Pie  QMM  that  saved  the  GMM  object  ID,  version  «>  address  mapping  on  request. 
Pie  database  Need  could  be  bached  up  if  desired  by  file  transfer  to  another  machine.  A 
somewhat  more  comotax  uae  would  be  to  i  wlortiratv  "enanahnfi"  of  the  dmabaaa. 

^sve^e  ^p  ww^^wn  ^ws^sp^sp  s^^w  sap  a^^^a^wa^gp  ^^sv^ag^^meapa^p  ape  ^pe^p  w^wssmsk^ea^^mp 

coneldering  this  entire  procedure  as  one  very  large  query  ••  transactions  would  sWI  bo 
Mtowsd  in  this  case.  It  is  also  poeefcle  to  do  Pda  in  a  much  more  dynamic  way ,  and  to  eMow 
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ffiw  or  mn  oupoev  ■  itv  jnvvwih  mu  By  wnng  nen  ntw  yum  win  wo  ov  mo  wfOr 
plmee  completion  let  or  the  writophoeo  completion  counter  (see  Chapter  9)-  That  coat  of 
this  tenor  eoomach  le  an  oosn  orobtwn.  but  would  oorhaoa  prove  uaaful  ohan  km 
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4.  A  General  Paradigm  for  Concurrency  Controls 

In  this  chapter  a  general  design  paradigm  for  concurrency  controia  will  be  presented.  First, 
it  is  necessary  to  discuss  in  more  detail  the  assumptions  made  regarding  the  fashion  in 
which  the  CC  controls  transactions. 

4.1.  Controlling  Transactions 

As  a  transaction  runs,  on  the  first  Tread  for  a  given  object,  the  LMM  win  send  the  CC  a 
Oread  message  for  the  object,  and  wait  for  a  reply.  This  message  is  interpreted  as  a  request 
for  read  access  to  the  object  The  CC  must  at  this  point  decide  whether  to  grant  the 
transaction  read  access  immediately,  in  which  case  a  positive'  reply  is  sent,  to  abort  the 
transaction,  in  which  case  a  negative  reply  is  sent  or  to  postpone  the  decision,  in  which 
case  a  reply,  perhaps  positive,  perhaps  negative,  will  be  sent  at  some  later  time. 

Similarly,  on  the  first  T writ  a  for  a  given  object,  the  LMM  wiU  send  the  CC  a  Cwrita  message 
for  the  object,  and  wait  for  a  reply.  This  Cwrita  message  is  interpreted  as  a  request  for 
read/write  access  to  die  object.  Again,  die  CC  replies  with  its  decision,  or  postpones  its 
decision. 

Although  the  CC  can  abort  a  transaction  by  replying  negatively  to  a  request,  it  may  also  be 
the  case  that  the  CC  will  decide  to  abort  a  transaction  at  a  point  where  the  transaction  is  not 
waiting  for  a  message  from  the  CC.  For  example,  the  CC  may  decide  to  abort  one 
transaction  based  on  a  request  from  another  transaction.  If  it  is  possible  to  interrupt  the 
transaction,  the  abort  may  be  handled  in  this  fashion.  Timing  problems  can  be  handled  by 
requiring  the  LMM  to  sand  an  acknowledgement  message  to  die  CC,  and  by  the  CC  marking 
the  transaction  as  aborted,  ignoring  aH  requests  from  that  transaction  until  the 
acknowledgement  is  received.  Alternatively,  the  CC  can  mark  the  transaction  as  aborted, 
and  send  a  negative  reply  on  the  next  request  from  the  transaction  (for  simplicity,  the  CC 
algorithms  of  Appendix  III  use  this  technique).  In  any  case,  the  transaction  is  said  to  be 
aborted  whenever  the  CC  either  replies  negatively,  interrupts  die  transaction,  or  marks  the 
transaction  as  aborted,  whichever  occurs  first 

If  a  transaction  already  has  read  or  read/write  access  to  some  object,  a  request  for  the 
iwti  vuna  of  kgotv  «q  ww  unit  oojtci  n  nmoito  oy  vrantauwy  imurntny  i  pommt  ropvy 
(unless  the  transaction  is  aborted). 

When  a  transaction  "ends"  (ends  from  the  point  of  view  of  the  RcdM),  the  LMM  win  send  the 
CC  a  CvaHd  message,  and  wait  lor  a  reply.  Prior  to  this  message,  a  transaction  that  has  not 
been  aborted  is  said  to  be  act  bra.  This  message  is  in  essence  a  request  to  the  CC  for  final 
approval,  or  validation,  of  the  transaction.  As  discussed  in  Chapter.  3,  an  overall  design 
decision  is  send  the  QMM  new  versions  ardv  if'  the  trmtaactkm  that  oanaralad  the  new 
versions  can  be  guaranteed  not  to  be  aborted.  Sines  the  LMM  win  begin  fits  writs  phase  pt 

iSkla  ■weafiweS  IS  m  MMlSkws  gAedu  a ^a  |La  few  1m  IklA  uaaa  OaaI  aewawfi  IS 
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the  CC  does  return  a  positive  reply,  the  transaction  is  said  to  be  validated.  Included  with  the 
reply  is  a  transaction  number,  which  the  LMM  wiH  use  as  a  version  number  for  att  objects 
written  in  the  write-phase,  and  which  the  GMM  will  eventually  use  in  WPCL  processing  (see 
Chpater  6).  Aborted  transactions  are  not  numbered.  A  transaction  number  is  generated 
simply  by  returning  the  value  of  an  integer  transaction  number  counter,  for  brevity  named 
TNC,  and  then  incrementing  the  counter.  Let  TNC  be  initially  one. 

At  any  point  in  time  there  will  be  a  validated  transaction  history  consisting  of  all  reads  and 
writes  of  validated  transactions  with  transection  numbers  1,  2,  3, ....  TNC- 1.  The  function  of 
the  CC  is  to  control  transactions  so  as  to  guarantee  that  the  validated  transaction  history  is 
always  serializable.  By  the  conflict  theorem  (see  Section  2.5),  this  means  that  the  validated 
transaction  history  must  be  kept  conflict-free.  There  are  two  methods  that  can  be  used  for 
guaranteeing  this.  First,  transactions  can  be  aborted  -•  an  aborted  transaction  will  not  be 
part  of  the  validated  transaction  history,  for  example,  let  a  be  a  transaction  in  the  validated 
transaction  history:  a  correct  concurrency  control  will  in  the  future  abort  any  (not  yet 
validated)  transaction  b  that  conflicts  with  a  (even  though  b  has  not  yet  been  assigned  a 
transaction  number,  it  is  known  that  if  it  ever  is  numbered,  the  transaction  number  of  a  will 
be  less  than  die  transaction  number  of  b,  so  it  makes  sense  to  speak  of  a  conflict  between  a 
and  b).  The  reason  for  this  is  that  if  b  conflicts  with  a,  one  of  the  two  must  be  aborted  *• 
however,  a  has  already  been  given  final  approval,  so  b  must  be  aborted.  Second,  in  an 
attempt  to  avoid  aborting  transactions,  transactions  can  be  made  to  wait  for  read  or 
read/write  access.  The  idea  is  to  rearrange  reads  and  writes  by  postponing  some  of  them 
so  that  the  validated  transaction  history  is  kept  conflict-free. 

Suppoee  now  that  a  Cvalid  massage  is  received  from  some  transaction  (that  is  not  aborted). 
This  transaction  does  not  conflict  with  any  previously  validated  transaction  (or  it  would  be 
aborted).  How  can  the  CC  handle  this  request?  As  far  as  consistency  of  the  databaee  is 
concerned,  the  transaction  can  be  validated  immediately.  Another  possibility  is  for  die  CC  to 
postpone  the  decision.  At  first  this  might  seem  pointless:  if  it  is  possible  to  validate  the 
transaction,  why  not  do  so  immediately?  The  reason  is  that  cases  may  arise  in  which  a 
transaction,  if  validated,  causes  a  conflict  with  an  active  transaction,  which  means  that  the 
active  transaction  must  then  be  aborted.  However,  if  the  validation  of  the  transaction  were 
postponed,  it  might  be  possible  to  eventually  validate  both  transactions.  A  transaction  for 
which  validation  has  been  postponed  is  said  to  be  pending. 

Finally,  after  a  validated  transaction  completes  its  write-phase,  the  LMM  will  send  the  CC  a 
Cend  message.  Such  a  transaction  is  then  said  to  be  completed.  No  decision  is  necessary 
at  this  point  for  the  transaction  that  sent  the  message,  since  the  final  decision  was  made 
earlier  when  the  transaction  was  validated  -  rather,  the  purpose  of  this  message  is  to  inform 
the  CC  that  the  new  versions  written  by  the  transaction  may  now  be  read  by  othsr 
transactions.  In  summary,  the  state  transitions  of  a  transaction  are  shown  in  Figure  4.1. 
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Figure  4. 1 .  Transaction  Stats  Transitions 


4.2.  Correctness  of  the  Concurrency  Control 

In  Section  2.5,  conflicts  were  defined  in  terms  of  transaction  histories,  and  transaction 
histories  were  defined  as  a  sequence  of  quadruples  of  the  form  <R,  /,  j,  o>  or  <W,  /,  j,  o>, 
where  i  is  a  transaction  number,  j  is  a  sequence  number,  and  o  is  an  object  ID.  However, 
this  formalism  is  unsuitable  for  the  present  purposes,  since  although  the  CC  has  information 
about  the  reads  and  writes  of  transactions,  it  has  no  Information  about  their  exact 
interleaving.  For  example,  with  respect  to  writes,  the  CC  "knows"  only  that  for  any 
transaction,  afl  writes  take  place  after  validation  and  before  completion.  Also,  the 
transaction  number  of  a  transaction  is  not  known  until  validation,  in  order  to  define  a 
correct  CC,  a  formalism  describing  the  actions  of  the  CC  is  necessary.  With  this  in  mind,  a 
CC  history  is  defined  as  a  sequence  of  tuples  of  the  following  forma,  where  a  is  a  transaction 
ID  and  p  is  an  object  ID: 

<R,  a,  p>,  meaning  a  is  granted  read  access  to  p; 

<W,  a,  p>,  meaning  a  is  granted  write  access  to  p; 

<V,  a>,  meaning  a  is  validated, * 

<A,  a>,  meaning  a  is  aborted; 

<C,  a>,  meaning  the  completion  message  from  a  is  received. 

The  following  predicates  wifl  be  useful. 

R(a,  p):  <R,  a,  p>  appears  in  the  CC  history; 

W(a,  p):  <W,  a,  p>  appears  in  the  CC  history; 


A  General  Paradigm  for  Concurrency  Controls 


31 


Vta):  <V,  a>  appears  in  the  CC  history. 

The  CC  history  of  an  executing  CC  is  maintained  by  appending  the  appropiate  tuple  as  each 
of  the  above  actions  is  taken  by  the  CC.  This  is  straightforward  except  for  <W,  a.  p>:  when 
a  transaction  requests  read/write  access  via  Cwrite  and  a  positive  reply  is  returned, 
if  fl(a,  p ),  then  only  <W,  a,  p>  is  appended;  otherwise,  <R,  a,  p>  and  <W,  a,  p>  are  both 
appended.  With  this  notation,  a  correct  CC  history  can  be  defined  as  follows. 

Correctness.  The  CC  history  is  correct  if  for  all  pairs  of  transaction  IDs  a  and  5  in 
the  history  such  that  R(a,  p)  and  W(b,  p),  one  of  the  following  cases  holds: 

Cl.  Not  V(a)  or  not  V(b); 

C2.  V{a)  and  V(b),  and  <V,  a>  precedes  <V,  b>; 

C3.  Via)  and  V(b),  <V,  b>  precedes  <V,  a>,  and  <C,  b>  precedes  <R,  a,  p>. 

A  correct  CC  is  a  CC  for  which  the  CC  history  is  kept  always  correct  This  correctness 
criterion  is  a  straightforward  application  of  the  conflict  theorem.  In  case  Cl,  there  is 
currently  no  conflict  between  a  and  0,  since  at  least  one  of  them  has  no  writes  to  the  current 
point.  Next,  if  the  transaction  number  of  a  is  less  than  the  transaction  number  of  b,  no 
conflict  is  possible  between  a  and  b  with  respect  to  a  read  of  a  and  a  write  of  b  if  all  reads 
of  a  take  place  before  any  writes  of  b,  as  is  the  case  in  C2.  Finally,  if  the  transaction 
number  of  b  is  less  than  that  of  a,  case  C3  requires  that  a  not  be  granted  read  access  to  the 
object  in  question  until  it  can  be  guaranteed  that  b  has  written  the  new  version  of  the  object 

The  correctness  criterion  is  violated  only  in  the  case  that  for  some  transactions  a  and  b, 
R(a,  p),  Wib,  p).  Via),  V[b),  <V,  b>  precedes  <V,  a>,  and  <R,  a,  p>  precedes  <C,  b>  (see 
Figure  4.2).  In  such  a  case  a  conflict  is  possible  in  the  validated  transaction  history: 
transaction  a  may  have  read  the  most  recent  version  of  the  object  with  ID  p  before  foe  new 
version  created  by  transaction  b  had  been  transferred.  Thus,  the  validated  transaction 
history  is  serial izabiQ  if  and  only  if  it  is  conflict-free,  and  it  can  be  guaranteed  to  be  conflict- 
free  if  and  only  if  foe  CC  history  satisfies  the  above  correctness  criterion. 

4.3.  The  Paradigm 

The  correctness  criterion  defines  correct  CC  histories  in  a  static  way.  given  a  CC  history,  it 
can  be  determined  if  foe  CC  history  is  correct.  The  problem  now  is  to  design  the  CC  so  that 
the  CC  history  is  kept  always  correct. 

The  independence  of  foe  CC  from  other  modules  and  from  applications  means  that  the  CC 
can  make  no  predictions  about  the  future  accesses  of  transactions.  In  fact,  as  explained  in 

5»wCIIOn  1  iCf  Mr  mOdl  CMOS  SvC*i  pTwCHCIHww  BrP  lnipOWW6»  I  nWoiwi  ©OfiO^BQfs® 

leading  to  a  violation  of  the  correctness  criterion  must  be  detected  dynemicaRy  based  on 
incoming  access  requests.  In  the  absence  of  such  conditions,  access  rsqussts  w*  atwaya 
be  granted.  That  is,  foe  CC  will  abort  tranaaetiona  or  have  thorn  wait  baaed  onty  on  current 
eonOMona  that  coukl  poaafbfy  lead  to  viotationa  of  foe  correctness  criterion. 


32 


Design  of  Concurrency  Controls  for  Transaction  Processing  Systems 


<W  ,b,p> 


<R,  a,  p>  <V,b> 


IXI 

<V,a>  <C,  b> 


-•  precedes  in  time 


Figure  4.2.  Only  Violation  of  Correctness  Criterion 

The  only  condition  that  could  possibly  lead  to  a  violation  of  the  correctness  criterion  is  that 
R{a,  p )  and  W(b,  p)  for  some  transactions  a  and  b.  This  condition  is  called  a  possible 
, conflict ,  and  can  be  detected  by  maintaining  for  each  object  ID  p  a  read  set,  the  set  of 
transactions  a  for  which  f?(a,  p),  and  a  write  set,  the  set  of  transactions  a  for  which  W(a,  p). 
Empty  read  or  write  sets  need  not  be  maintained;  also,  these  sets  need  not  be  maintained  for 
aborted  transactions  (due  to  Cl).  It  will  also  be  seen  that  they  need  not  be  maintained  for 
completed  transactions  (this  is  due  to  the  fact  that  transactions  are  numbered  in  validation 
order  ••  see  Section  4.5  below). 

The  paradigm  will  be  described  by  listing  various  options  for  handling  access  and  validation 
requests.  When  fl(a,  p)  and  W(b,  p)  is  detected  for  some  transactions  a  and  b,  both  active 
or  pending,  it  is  assumed  that  the  CC  records  this  fact  for  later  reference.  The  alternative  Is 
to  possibly  (depending  on  the  options  selected)  later  check  various  read  and  write  sets  for 
intersection,  which  may  become  excessively  time-consuming  for  a  large  number  of 
transactions  or  for  large  read  and  write  sets.  When  R(a,  p)  and  W(b,  p),  a  and  b  active  or 
pending,  this  relation  between  a  and  b  is  written  as  a  -♦  b. 

The  meaning  of  the  relation  a  -*  b  is  that  in  order  to  validate  both  a  and  b,  a  must  be 
validated  before  b  (this  is  from  C2  -  note  that  C3  does  not  apply  since  b  is  not  completed). 
Depending  on  the  options  selected,  it  may  arise  that  a  -*  b  and  b  -*  a,  in  which  case  only 
one  of  the  two  transactions  can  be  validated.  SimHarty,  if  a  -*  b,  b  -*  c,  and  c  -*  a,  only 
two  of  these  three  transactions  can  be  validated.  In  an  attempt  to  validate  both  a  and  b 
when  a  b  but  not  b  -»  a,  and  also  in  an  attempt  to  avoid  cyclic  -*  conditions  such 
as  a  -»  b  and  b  -*■  a,  an  access  or  validation  request  may  be  postponed  until  one  or  more 
events  have  occurred.  This  is  called  scheduling.  When  the  event  or  events  have  occurred, 
the  access  or  validation  request  is  re- analyzed  as  if  newly  arrived.  It  is  assumed  that  the 


r7  1 


A  General  Paradigm  for  Concurrency  Controls 


goal  of  scheduling  Is  to  avoid  aborting  transactions.  Therefore,  from  the  correctness 
criterion,  there  are  two  types  of  events  that  can  be  used  in  scheduling:  the  validation  of  a 
transaction  (C2),  and  the  completion  of  a  transaction  (C3).  The  following  notation  will  be 
used  for  scheduling. 

a  =*v  b:  a  positive  reply  to  the  current  access  or  validation  request  of  transaction 
b  will  not  be  sent  until  transaction  a  is  aborted  or  validated  (transaction 
b  may  be  aborted  before  then); 

a  -*c  b:  a  positive  reply  to  the  current  access  request  of  transaction  b  will  not  be 
sent  until  transaction  a  is  aborted  or  completed  (transaction  b  may  be 
aborted  before  then). 

When  a  transaction  requests  access  to  an  object  p,  it  may  be  desirable  for  the  CC  to  base 
its  decision  on  the  processing  of  this  request  not  only  on  those  transactions  a  for  which  R(a, 
p)  or  W(a,  p),  but  also  on  those  transactions  for  which  access  to  p  has  been  postponed 
(e.g.,  in  order  to  queue  requests).  If  a  read  or  write  request  from  transaction  a  for  object  p 
has  been  postponed,  this  is  written  as  RP(a,  p)  or  WP{a,  p),  respectively. 

The  paradigm  follows.  The  generality  of  the  paradigm  is  considered  in  Section  4.5. 


Reed  request.  Transaction  a  requests  read  access  to  p,  no  current  read  acce 

R1  (aborted  and  completed  transactions).  Any  aborted  transaction  can 
be  ignored  by  Cl,  and  any  completed  transaction  can  be  ignored  by  Cl  (if 
a  is  never  validated)  or  C3  (if  a  is  later  validated). 

R2  (postpone).  For  each  active,  pending,  or  validated  transaction  b  with 
W(b,  p)  or  WP(b,  p)  (WP{b,  p )  is  possible  only  for  b  active),  do  one  of  the 
following. 


R2.1  (skip).  Skip  b  in  this  step. 

R2.2  (abort).  Abort  b  (applicable  only  If  b  is  not  validated). 

R2.3  (  -»c  ).  Schedule  b  -»c  a  (thereby  avokHnp  aborting  a, 
aborting  0,  or  a  -*  P  ••  see  R3  and  R4  below). 


if  R2.3  was  selected  for  any  0,  the  access  requsei 
the  CC  history  remains  correct  by  Cl.  Later,  the 
processed;  for  now,  terminate. 


R3  (abort).  It  is  saaumsd  now  that  the 
postponed.  If  there  is  any  vaidated  traneactic 
nor  C3  can  ever  be  true  lor  «  and  b  V  the  ac 
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request  witi  be  re* 


sceas  request  wM  not  be 
b  with  W{b,  p),  neMher  C2 
os  request  is  now  granted, 
grrectneae  criterion,  N  a  is 


granted  read  access  now,  it  may  read  inconsistent  data,  since  the  new 
version  of  p  written  by  b  may  or  may  not  be  read  by  a.  Therefore,  V  there  is 
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any  validated  transaction  b  with  W(b,  p),  since  a  can  never  be  validated,  a 
should  be  aborted:  abort  a  and  terminate  Otherwise,  if  there  are  any  active 
or  pending  transactions  with  W(i>,  p)  or  WP{b,  p),  optionally  abort  a,  and 
terminate. 

R4  (grant).  For  each  active  or  pending  transaction  b  with  W(b,  p), 
record  a  -*  b.  Then,  grant  a  read  access,  and  terminate. 

Write  request.  Transaction  a  requests  write  access  to  p,  no  current  write  access. 

W1  (aborted  and  completed  transactions).  Any  aborted  transaction  can 
be  ignored  by  Cl,  and  any  validated  or  completed  transaction  can  be 
ignored  by  Cl  (if  a  is  never  validated)  or  C2  (if  a  is  later  validated). 

W2  (postpone).  For  each  active  or  pending  transaction  b  with  R(b,  p)  or 
RP(b,  p)  {RPib,  p)  is  possible  only  for  b  active),  do  one  of  the  following. 

W2.1  (skip).  Skip  b  in  this  step. 

W2.2  (abort).  Abort  b. 

W2.3  (  ).  Schedule  b  «*v  a  -  in  this  option,  a  waits  for  the 

validation  of  b  at  the  current  point,  rather  than  possibly  waiting  at  the 
validation  point  in  VI. 3  below. 

If  W2.3  was  selected  for  any  b,  the  access  request  has  been  postponed, 
and  the  CC  history  remains  correct  by  Cl.  Later,  the  access  request  will  be 
re-processed;  for  now,  terminate. 

W3  (abort).  If  there  are  any  active  or  pending  transactions  b  for  which 
R(b,  p)  or  RP(b,  p),  optionally  abort  a,  and  terminate. 

W4  (grant).  For  each  active  or  pending  transaction  b  with  R(b,  p), 
record  b  -*  a.  Then,  grant  a  read  access,  and  terminate. 

Note  that  the  write  paradigm  can  be  obtained  from  the  read  paradigm  by  interchanging  R 
and  W,  «*c  and  ■%,  by  reversing  by  replacing  "completed”  with  "vaRdated  or 
completed"  in  Rl,  and  by  removing  the  now  inapplicable  statement  that  a  be  aborted  if  there 
are  any  validated  conflicting  transactions  in  R3. 

Read/write  request.  Transaction  a  requests  read/writs  gooses  to  p,  no  current 

In  the  processing  of  this  request  the  set  of  posstoty  conflicting  transactions 
are  all  those  transactions  b  with  W(b,  p),  WPib,  p),  R(b,  p),  or  RP(b,  p). 

Again,  any  of  these  may  optionally  be  aborted;  a  may  opBowaBy  be 
postponed;  a  may  optionally  be  aborted;  or  a  may  be  granted  read/witta 
access.  In  the  case  that  a  is  poatponad,  «*c  scheduling  must  be  used  tar 
any  transaction  b  with  Wip,  p)  or  WP[b,  p),  and  scheduling  tar  any 
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transaction  b  with  Rib,  p)  or  RP{b,  p)  (there  is  no  harm  in  using  both  typos 
of  scheduling  with  rsspsct  to  the  same  transaction  ••  in  such  a  case  ■*v  is 
simply  superfluous).  As  in  a  request  for  read  access,  if  a  is  not  postponed 
and  there  is  a  validated  transaction  b  with  W[b,  p),  a  should  be  aborted. 

Finally,  if  a  is  granted  reed/write  access,  a  -*  6  must  be  recorded  for  each 
transaction  b  with  W(b,  p),  and  b  -*■  a  must  be  recorded  for  each  transaction 
b  with  Rib,  p)  (both  might  be  recorded  with  respect  to  the  same  transaction). 

Validation  request.  Transaction  a  requests  validation. 

VI  (postpone).  For  each  active  or  pending  transaction  b  with  b  -*  a,  do 
one  of  the  following. 

VI. 1  (skip).  Skip  b  in  this  step. 

VI. 2  (abort).  Abort  b. 

VI  .3  (  *»v ).  Schedule  b  a  (thereby  possibly  avoiding  aborting 
a  or  b). 

If  V1.3  was  selected  for  any  b,  the  validation  request  has  been  postponed,  a 
is  now  pending,  and  the  CC  history  remains  correct  by  Cl.  Later,  the 
validation  request  will  be  re- processed;  for  now,  terminate. 

V2  (abort).  If  there  are  any  active  or  pending  transactions  b  with  b  -*  a, 
optionally  abort  a,  and  terminate. 

V3  (validate).  For  each  active  or  pending  transaction  b  with  b  a,  abort 
b.  Then  validate  a,  and  terminate. 

This  completes  the  description  of  the  CC  paradigm.  In  the  next  section  the  question  of  how 
best  to  use  the  paradigm  is  considered. 

4.4.  Policies 

The  correctness  criterion  gives  only  those  necessary  and  sufficient  condtions  for  thq 
aerializabiHty  of  the  validated  transaction  history  --  it  does  not,  for  example,  rule  out  the  case 
in  which  a  transaction  reads  inconsistent  data  (see  R3  above),  although  it  does  rule  out  the 
case  in  which  such  a  transaction  is  ever  validated.  Nor  does  it  rule  out  the  case  in  which  a 
transaction  is  never  validated  or  aborted.  In  designing  the  paradigm  for  pro  ceasing  access 
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the  CC,  and  the  assumption  Prat  the  purpose  of  scheduling  is  to  avoid  aborting  tram  actions. 
No  other  criteria,  such  as  "fairness",  or  the  guaranteed  eventual  successful  completion  of 
transactions,  were  used.  But  in  practice,  first,  it  is  necessary  to  choose  one  of  the  options 
provided  by  the  paradigm',  second,  additional  correctness  properties  such  as  guaranteed 
eventual  successful  completion  may  be  important;  and  Rnalty,  it  is  daairabli  to  choose 
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a  policy,  which  is  that  part  of  the  CC  design  tint  chooses  the  options  as  provided  by  the 
above  paradigm. 

Some  additional  correctness  properties  that  may  be  important  have  to  do  with  the  two 
mechanisms  that  are  used  to  control  transactions:  scheduling  and  aborting.  In  the  case  of 
scheduling,  two  well-known  problems  are  deadlock  (in  which  the  union  of  the  and 
relations  is  not  a  partial-ordering)  and  starvation  (in  which  a  transaction  ia  repeatedly 
scheduled  so  that  it  waits  potentially  forever).  Assuming  aborted  transactions  are 
automatically  restarted  until  they  complete  successfully  (this  may  or  may  not  be  the  case  in 
an  application),  two  analogous  problems  for  aborting  are  cylic  restart  (in  which  a  finite  set  of 
transactions  repeatedly  cause  each  other  to  be  aborted  so  that  none  of  the  transactions  is 
ever  validated)  and  infinite  restart  (in  which  a  transaction  is  repeatedly  aborted  due  to 
possible  conflict  with  a  potentially  infinite  set  of  transactions).  Note:  there  does  not  appear 
to  be  widespread  agreement  in  the  literature  in  terminology  for  these  latter  two  problems. 
Whether  or  not  any  of  these  conditions  are  problems  depends  both  on  the  policy  and  the 
application:  if  the  policy  never  chooses  a  scheduling  option,  clearly  deadlock  and  starvation 
are  not  problems.  On  the  other  hand,  perhaps  deadlock  and  starvation  are  possible,  but  in 
the  application  it  is  acceptable  either  to  assume  that  (in  the  case  that  transactions  are 
interactively  generated)  impatient  users  will  abort  their  transactions,  or  to  assume  that  the 
LMM  will  abort  transactions  on  timeouts.  This  latter  mechanism,  which  would  often  be  ueed 
in  a  distributed  environment,  could  perhaps  make  deadlock  detection  unnecessary.  If 
deadlock  detection  is  necessary,  then  there  may  be  a  policy  question  of  how  often  to  check 
for  possible  deadlock  (see  (Gray  78]). 

All  known  solutions  to  the  problems  of  cyclic  and  infinite  restart  involve  some  kind  of  priority 
scheme.  The  general  idea  is,  first*  to  design  the  policy  so  that  transactions  with  sufficiently 
high  priority  will  never  be  aborted,  and  second,  to  give  a  transaction  increasing  priority  as  It 
becomes  older  or  is  repeatedly  aborted.  Of  course,  priority  schemes  can  be  based  on 
performance  criteria  as  wed.  Some  of  the  many  possible  priority  schemes  are:  (1)  give 
increasing  priority  to  transactions  as  they  are  aborted;  (2)  give  increasing  priority  to 
transactions  as  their  original  starting  time  (that  is,  a  starting  time  not  changed  by  repeating  a 
transaction  (toe  to  a  failure)  becomes  older  ~  these  are  the  timeetamp-besed  approaches 
(see  the  following  section);  (3)  give  priority  to  transactions  that  are  generated  interactively; 
(4)  give  priority  to  transactions  that  are  part  of  some  real-time  process;  (5)  give  priority  to 
transactions  that  are  for  some  reason  expensive  to  retry  (e.g.,  "big"  transactions).  Various 
priority  schemes  can  also  be  combined. 

The  use  of  any  priority  scheme  would  involve  extensions  to  the  CC  interlace  as.  presented  in 
Appendix  It,  e.g.,  inclusion  of  a  unique  transection  ID,  a  starting  time,  or  a  transaction  class 
as  a  parameter  of  Cbegfn.  Here,  various  besfc  pofcfrs  that  do  not  use  any  priority  aeheato 
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required  by  the  policy  as  additional  peremelere  to  Cbsgkt. 

Policies  can  be  defined  statically  at  design-time,  or  dynamically  as  policy  modules,  in  the 
case  that  policies  are  defined  statically  various  optimizations  can  be  made  ••  for  example,  in 
some  policies  deadlock  is  impossible,  and  so  deadlock  detection  may  be  omitted.  On  the 
other  hand,  the  policy  module  approach  may  offer  efficiency  advantages  in  the  case  that  the 
policy  module  is  designed  so  as  to  be  able  to  change  policies  at  run-time.  In  the  caae  that 
several  policies  are  of  interest,  but  their  differences  cannot  be  predicted,  it  becomes  easy  to 
experiment  with  these  policies  by  using  a  policy  module  that  provides  ail  such  poficisa.  An 
example  is  the  policy  module  used  in  the  Cm*  system  (see  Chapter  7),  which  provided  330 
distinct  basic  policies  (see  Chapter  5). 

4.5.  Generality  of  the  Paradigm 

A  great  deal  of  previous  research  in  concurrency  control  design  can  be  viewed  as  policy 
design.  It  seems  that  some  confusion  has  often  resulted  due  to  the  lack  of  a  dear 
separation  of  fundamental  correctnes  criteria  (e.g.,  the  correctness  criterion  above), 
additional  correctness  criteria  (e.g.,  guaranteed  eventual  successful  completion),  and  policy 
criteria  (e.g.,  giving  priority  to  transactions  that  are  expensive  to  retry).  Although  the  design 
problems  may  be  listed  independently,  they  may  not  be  handled  independently  in  the  design 
itself.  This  has  had  the  effect  of  making  essentiafiy  similar  concurrency  controls  seem 
superficially  quite  different  In  fact,  in  a  recent  extensive  survey  of  proposed  concurrency 
controls  [Bernstein  and  Goodman  61],  it  is  concluded  that  "ad  practical  concurrency  control 
methods  can  be  analyzed  as  combinations  and  variations  of  two  basic  synchronization 
techniques:  two- phase  locking  and  timestamp  ordering." 

Two-phase  locking  (due  to  [Eswaran  et  al  76])  is  obtained  from  the  above  paradigm  by 
selecting  options  R2.3  and  W2.3  (or  alternatively,  V1.3  in  place  of  W2.3)  whenever 
possible  (i.e.,  in  the  absence  of  deadlock).  There  are  many  variations  of  timestamp  ordering 
concurrency  controls,  and  its  true  nature  is  often  obscured  by  combining  the  technique  with 
two-phase  locking  types  of  policies,  in  what  might  be  called  "pure”  timestamp  ordering  (no 
two-phase  kicking  component),  options  R2.2  or  R3  and  W2.2  or  W3  are  always  selected, 
with  R2.2  or  W2.2  selected  if  (referring  to  the  paradigm)  a  has  an  earlier  original  starting 
time  than  b,  and  with  R3  or  W3  selected  otherwise.  Although  the  above  claim  of  [Bernstein 
and  Goodman  80]  cannot  be  agreed  with  here  (timestamp  ordering  seems  more  property  a 
policy  priority  scheme,  and  concurrency  controls  involving  a  -*  b  options  R4.1  and  W4.1 
are  neoiected).  they  do  dssrtv  ooint  out  that  almost  al  proocnod  concurrency  controls  are 
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from  the  basic  correctness  problem  of  guaranteeing  serializability.  As  an  example  of  a 
complex  policy  problem,  two-phase  locking  assumee  that  aborting  transactions  is  expensive, 
and  tries  to  avoid  this  whenever  possible,  but  at  the  expense  of  scheduling.  Yet,  given  the 
LMM  description  of  Section  3.5,  a  transaction  that  will  later  be  aborted  is  still  possibly  doing 
useful  work  in  caching  copies  of  objects.  Such  a  transaction,  when  restarted,  could  quickly 
run  to  completion  if  very  few  object  transfers  were  then  necessary.  So  whether'  aborting  or 
scheduling  is  more  expensive  is  not  obvious,  and  is  most  likely  highly  application  dependent 
a  truly  general  concurrency  control  should  provide  for  all  kinds  of  policies,  including  those 
that  select  a  -*  b  options,  such  as  [Kung  and  Robinson  81]  and  [Steams  and  Roeenkrantz 
81].  The  advantages  of  separating  basic  correctness  and  policy,  in  system  correctness  and 
maintainability,  have  been  discussed  in  a  more  general  context  in  [Everhart  79]. 

The  paradigm  above  is  completely  general,  given  the  following  conditions. 

1.  Access  requests  cannot  be  predicted  in  advance. 

2.  The  goal  of  scheduling  is  to  postpone  transactions  for  as  short  a  time  as  possible 
in  order  to  attempt  to  prevent  aborts. 

3.  A  transaction,  once  validated,  cannot  be  aborted,  and  the  validated  transaction 
history  must  be  kept  serializable  in  validation  order. 

Condition  (1)  is  common  to  all  application-independent  concurrency  controls.  Condition  (2) 
excludes  highly  heuristic  scheduling  techniques  such  as  "wait  10  seconds  and  retry.”  Such 
techniques  may  be  valuable  in  distributed  systems,  but  are  beyond  the  scope  of  this  work. 
Condition  (3)  seems  to  be  common  to  ail  practical  concurrency  controls.  The  notion  of 
validating  a  transaction,  or  giving  it  final  approval,  is  also  often  called  committing  a 
transaction  (e.g.,  see  [Gray  78]  -  but  also  see  the  note  below).  This  seems  to  be  a 
necessary  simplification  to  make  the  problem  of  concurrency  control  manageable.  In  fact,  it 
is  hard  to  imagine  a  system  where  one  never  knew  for  sure  whether  a  transaction  was 
completed  -  the  notion  of  a  validation  or  commit  point  seems  inescapable. 

On  the  other  hand,  maintaining  the  validated  transaction  history  serializable  in  validation 
order  is  an  efficiency  constraint.  Note  that  serializability,  as  defined  here,  depends  only  on 
the  ordering  of  transaction  numbers,  and  not  on  the  numbers  themaetvea.  Thus,  if 
transaction  a  requests  validation,  and  the  current  validated  transaction  history  consists  of  the 
transactions  numbered  1, 2, 3, ....  n,  then  a  could  conceivably  be  validated  under  transection 
number  1.5,  2.5,  3.5,  etc.  However,  this  would  require  maintaining  read  and  writs  sets  tor 
completed  transactions,  would  probably  prove  excessively  time-consuming,  and  could  requite 
(depending  on  the  design)  query  validation  as  we*.  Therefore,  such  schemes  are  rejected 
here.  It  seems  that  in  at  existing  or  proposed  concurrency  controls  in  which  tranaactlona 
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manager  commit  apparently  hare  pravioudy  been  confuatd  -  in  fact,  in  many  systems,  they 
are  (accidentally)  the  same,  probably  due  to  the  lack  of  separation  of  CC  design  from  RcvM 
design.  The  CC  commit  point  is  described  above.  A  RcvM  commit  point  is  quite  different 
assuming  a  memory  hierarchy,  a  number  of  RcvM  commit  points  at  different  levels  can  be 
defined  as  points  at  which,  if  memory  at  the  given  level  of  the  hierarchy  dose  net  fad,  the 
writes  of  transactions  that  have  completed  write- phases  (say)  to  that  level  can  be  recovered. 
\  Here,  it  was  earlier  decided  to  send  the  GMM  new  versions  only  after  (CC)  validation,  ,  for 

efficiency  and  simplicity  reasons,  which  results  in  CC  commit  points  preceding  RcvM  commit 
points. 

Finally,  the  paradigm  by  no  means  solves  the  general  concurrency  control  problem,  since  the 
scope  of  concurrency  control  policy  design  is  so  large.  For  example,  much  of  the  research 
in  distributed  concurrency  control  design  can  be  seen  as  the  following  problem:  design  the 
policy  so  that  a  decision  regarding  an  access  request  for  an  object  p  can  be  made  using 
only  information  that  is  present  at  the  node  where  p  is  stored.  Another  large  area  of 
research  in  policy  design,  now  made  possible  with  the  above  general  paradigm,  is  that  of 
designing  a  policy  module  that  selects  the  optimal  type  of  concurrency  control  based  on, 
say,  performance  monitoring  or  usage  statistics.  This  is  made  possible  by  the  above 
paradigm  since  the  policy  module  need  not  be  restricted  to  any  one  type  of  policy;  whatever 
decisions  are  made  by  the  policy  module,  the  validated  transaction  history  stM  remains 
serializable.  For  example,  a  policy  module  that  selected  options  at  random  would  still  be  a 
valid  policy  module,  in  terms  of  guaranteeing  serializabHity. 

4.6.  Partitioning  the  Concurrency  Control 

In  practice,  it  is  often  the  case  that  possible  conflicts  between  transactions  arise  rarely  (aae 
[Kung  and  Robinson  81]  for  a  discussion  of  systems  where  this  is  likely  to  hold).  In  these 
cases,  most  of  the  work  done  by  the  CC  is  simply  checking  for  each  access  request  for 
some  object  that  the  read  or  write  set  for  that  object  is  empty,  and  after  the  request  haa 
been  granted,  updating  the  read  or  write  set  for  that  object.  In  the  case  that  the  CC  forme  a 
system  bottleneck,  this  suggests  the  foNowing  scheme  for  introducing  parallelism  in  the  CC: 
partition  the  set  of  shared  data  objects  in  some  fashion  (for  example,  by  mapping  each 
object  with  ID  ID  into  partition  number  ID  moo  n,  where  there  are  n  partitions),  and  uae  a 
separata  process  to  manage  the  read  and  write  sets  for  each  partition.  In  a  computer 
network  application,  if  objects  are  partitioned  based  on  the  node  where  the  transaction  that 
created  the  object  originated,  the  result  is  a  type  of  primary  ska  approach. 

The  remaining  information  managed  by  the  concurrency  control  is  transaction  information: 
the  statue  of  each  running  transaction,  the  sat  of  objects  accssisd  for  each  transaction 
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Alternatively,  several  processes  could  be  used  to  manege  transection  information  (for 
exempts,  the  same  processes  that  manage  read  and  write  sets),  and  aM  transection 
information  could  be  stored  in  shared  memory  (In  this  case,  access  to  transection 
information  would  have  to  be  synchronized,  but  designing  the  synchronization  mechanism 
does  not  present  any  fundamentally  new  problems). 

In  the  ease  of  computer  networks,  a  large  number  of  schemes  have  been  proposed  in  which 
transection  information  is  distributed  over  the  network  (typically,  if  a  conflict  develops 
between  two  transactions  due  to  an  access  request  for  some  object,  this  information  is 
stored  at  the  node  where  the  object  is  stored).  The  use  of  a  central  transaction  number 
counter  can  be  avoided  by  determining  the  order  in  which  transactions  are  possibly  validated 
beforehand  (by  assigning  timestamps  at  the  beginning  of  transactions,  for  example  ~  H  the 
timestamp  of  a  is  less  than  that  of  b,  in  order  to  validate  both  a  and  0,  a  must  be  validated 
before  b).  If  multi-version  objects  are  used,  the  same  ordering  must  be  realized  in  version 
numbers.  The  main  problems  with  these  approaches  are  that  it  is  more  difficult  to  be  sure 
that  the  system  is  correct  (the  predetermined  ordering  is  often  used  in  an  attempt  to  cause 
identical  decisions  to  be  made  at  different  nodes  without  communication,  and  so  the 
correctness  argument  depends  on  a  priority  scheme),  and  often  there  are  no  resulting 
performsnee  advantages  (see  [Garcia- Molina  79]).  However,  if  the  network  is  geographically 
distributed  (with  nodes  in  different  cities,  for  example)  and  there  is  locality  of  reference 
(transactions  almost  always  access  objects  stored  at  the  node  where  the  transaction 
originates),  there  are  dear  advantages  to  these  approaches.  In  such  cases,  if  the  possfoie 
validation  ordering  is  determined  beforehand,  the  paradigm  can  be  made  to  apply  by  adding 
the  following  restriction:  a  -*  b,  a  b,  or  a  «*c  b  can  be  chosen  as  an  option  only  if  a 
precedes  b  in  the  predetermined  ordering. 
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5.  Basic  Policies 

In  this  chapter  a  set  of  basic  policies  is  defined.  It  will  be  seen  that  the  design  space  for 
concurrency  controls  is  much  larger  than  has  previously  been  recognized  ~  even  in  the 
simplest  case  in  which  all  transactions  are  handled  uniformly  there  is  a  large  number  of 
distinct  policies.  This  set  of  policies  should  be  considered  only  as  a  foundation  for  policy 
development,  since  in  practice  there  are  many  valuable  extensions.  Two  extensions  to 
policies  using  scheduling,  deadlock  detection  and  queuing  of  requests,  are  described.  Other 
extensions  involve  the  introduction  of  priority  schemes,  as  described  in  Chapter  4.  As 
examples  of  how  one  might  begin  to  develop  a  policy,  possible  philosophies  behind  several 
policies  are  discussed. 

5.1 .  Definition  of  the  Basic  Policies 

A  basic  policy  is  defined  here  as  a  policy  in  which  all  transactions  are  handled  uniformly 
without  the  use  of  priorities.  For  each  type  of  request,  the  paradigm  of  the  previous  chapter 
defines  a  set  of  transactions  that  may  conflict  with  the  transaction  issuing  the  request:  for  a 
read  request  for  p,  all  transactions  b  with  W(b,  p)  or  WP(b,  p);  for  a  write  request  for  p,  al 
transactions  b  with  R(b,  p)  or  RP{b,  p);  for  a  read/write  request  for  p,  all  transactions  b  with 
W{b,  p),  WP(b ,  p),  R(b,  p)  or  RP{b,  p);  and  for  a  validation  request  from  a,  all  transactions  b 
with  b  a.  In  each  case,  these  transactions  will  be  called  here  simply  the  conflicting 
transactions.  Given  a  request,  if  the  set  of  conflicting  transactions  is  non-empty,  the  poiey 
must  be  consulted.  For  a  basic  policy,  ait  conflicting  transactions  are  treated  uniformly.  A 
number  of  basic  policies  can  be  obtained  by  choosing  one  of  the  following  options  for  each 
type  of  request  (the  ”kill/dieM  terminology  is  taken  from  [Roeenkrantz  et  al  78]). 

wait  •  have  the  requesting  transaction  wait  on  all  conflicting  transactions,  using 
or  *»c  scheduling  as  given  by  the  paradigm. 

kill  •  abort  all  conflicting  transactions. 

die  •  abort  the  requesting  transaction. 

grant  -  grant  the  access  request  (not  an  option  for  a  validation  request). 

For  a  read/write  request  from  a  for  p,  if  the  wait  or  kill  option  is  used,  one  might  want  to 
handle  transactions  b  with  R(b,  p)  or  RP(b ,  p)  but  not  W{b,  p)  nor  WP(b,  p)  separately  from 
transactions  b  with  Mf(b,  p)  or  WP(b,  p).  For  example,  transactions  b  with  R(b,  p)  or  RP(b.  P) 
but  not  W(b,  p)  nor  i VP(0,  p)  could  be  ignored  at  this  point,  and  possibly  be  waited  on  at  the 
validation  point  if  toe  possible  conflict  does  not  turn  into  a  real  conflict,  while  trsnaactlons 
with  W{b.  p)  or  WR{b,  p)  might  be  waited  on  at  the  cunant  point  To  handle  these  schemes, 
three  sub-options  may  be  added  to  the  wait  or  kill  options  *»  the  ease  of  a  read/write 
request  as  foltowe. 

read  -  wait  on  or  abort  only  those  transactions  that  have  Issued  a  conflicting  read 
request  but  have  not  issued  a  conflicting  write  request 
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write  -  wait  on  or  abort  only  those  transactions  that  have  issued  a  conflicting  write 
request 

alt  -  wait  on  or  abort  all  conflicting  transactions. 

Using  these  sub-options,  4x4x(2x3+2)x3  ■  384  basic  policies  have  been  defined.  However, 
there  is  some  redundancy:  if  the  grant  option  is  never  used,  and  the  all  sub-option  is  used 
for  a  read/writs  request,  there  wilt  never  be  any  conflicting  transactions  at  the  validation 
point,  and  so  the  validation  option  is  unused.  Eliminating  this  redundancy,  384  •  3x3x3x2  • 
330  distinct  basic  policies  have  been  defined.  In  the  Cm*  system  described  in  Chapter  7,  a 
policy  module  was  used  that  provided  all  of  these  policies. 

One  might  also  consider  policies  in  which  active  and  pending  or  postponed  and  non* 
postponed  transactions  are  differentiated  to  be  basic  policies,  although  these  could  also  be 
considered  priority  schemes.  In  any  case,  sub-options  to  differentiate  among  classes  of 
transactions  can  be  added  to  the  above  scheme  for  basic  policies,  hither  increasing  the 
number  of  policies. 

5.2.  Deadlock  Detection 

Policies  that  use  wait  options  may  avoid  deadlock  by  the  use  of  a  priority  scheme,  at  the 
expense  of  increased  probability  of  aborts  (e.g.,  see  [Rosenkrantz  et  al  781).  For  the  basic 
policies  defined  above,  though,  deadlock  is  possible.  In  this  section  the  deadlock  detection 
scheme  used  in  the  Cm*  system  will  be  described.  Some  alternatives  to  this  scheme  are  to 
"not  worry”  about  deadlock  (relying  on  timeouts  to  abort  transactions,  for  example),  or  to 
periodically  check  the  wait  relation  for  cycles  (see  [Gray  78]). 

The  scheme  used  in  the  Cm*  system  was  to  schedule  b  *»v  a  or  b  a  only  if  it  was  not 
the  case  that  a  -•*  to,  where  ^*  is  the  transitive  closure  of  the  union  of  the  and  *»c 
relations.  If  this  could  not  be  done,  the  requesting  transaction  was  aborted.  In  this  way  the 
union  of  the  «*v  and  ■*c  relations  was  maintained  as  a  partial  ordering.  In  order  to 
determine  if  a  ■»*  b,  the  following  simple  recursive  procedure  was  used. 

1.  If  a  «wv  b  or  a  «wc  b,  then  a  b  -  return  tree. 

2.  For  each  transaction  c  such  that  a  **v  c  or  a  "»c  c:  if  c  «•*  b,  then 
a  ■»*  b  -  return  tree. 

3  Otherwise,  It  is  not  the  case  that  a  ■**  b  -  return  false. 

At  this  point  a  modification  that  may  be  made  to  the  basic  policy  wait  option  above  can  be 
described.  Consider  the  following  example:  a  requests  write  access  to  p,  and  the  request  is 
postponed  so  that  tVP(a,  p);  next,  b  requests  read  access  to  p,  and  a  >»c  b  is  scheduled. 
Later,  when  the  write  request  from  a  is  re-processed,  0  wM  be  a  conflicting  transaction  since 
ffP(b,  p),  but  scheduling  b  a  would  lead  to  deadlock,  and  so  a  is  aborted.  This  doss  not 
seem  to  make  sense  in  terms  of  a  policy:  if  a  reafly  should  be  aborted  due  to  the  acoeea 
request  from  b,  why  not  abort  e  at  the  time  the  eccsss  request  is  received?  For  tide  reeeon, 
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in  the  Cm*  policy  module,  when  processing  a  request  from  a  for  p,  ail  transactions  b  such 
that  a  a»v  b  or  a  «*c  b  and  RP(b,  p)  were  removed  from  the  set  of  conflicting  transactions 
in  the  case  that  a  wait  option  was  selected.  It  should  be  dear  by  now  that  any  modification 
such  as  this  does  not  affect  fundamental  correctness  ••  this  is  one  of  the  main  strengths  of 
the  paradigm.  In  this  case,  the  result  is  a  queueing  structure  on  objects  in  which,  for  each 
object,  a  number  of  readers  can  be  waiting  for  a  writer  which  can  in  turn  be  waiting  on  a 
number  of  readers,  etc.,  on  a  first-come  first-served  basis.  Of  course,  other  schemes  could 
be  used.  For  example,  a  type  of  reader-priority  scheme  results  if,  on  a  read  request  from  a 
for  p,  transactions  b  with  WP(b,  p)  are  removed  from  the  set  of  conflicting  transactions 
before  applying  the  wait  option. 

5.3.  Some  "Interesting"  Policies 

The  two-phase  locking  policy  is  obtained  by  selecting  the  wait  option  for  read  and  write 
requests,  and  the  wait  alt  option  for  read/write  requests  -  the  validation  option  is  redundant 
As  noted  earlier,  the  goal  of  this  policy  is  to  avoid  aborts  if  at  afl  posaMe. 

An  optimistic  policy  is  obtained  by  selecting  the  grant  options  for  read,  write,  and  read/write 
requests,  and  the  kill  option  for  validation  requests.  In  this  policy,  transactions  never  watt, 
and  conflicting  transactions  "race"  to  the  finish:  given  a  set  of  conflicting  transactions,  the 
transaction  that  first  requests  validation  completes,  and  the  conflicting  transactions  are 
aborted. 

There  are  a  variety  of  policies  that  lie  between  the  two- phase  locking  and  optimistic  policies. 
In  these  policies,  combinations  of  wait,  grant,  die,  and  kill  options  are  used.  For  example, 
two-phase  locking  could  be  modified  ao  as  to  grant  a  read  request  from  a  for  p  even  if  for 
some  b,  W{b ,  p)  --  although  this  introduces  a  -*  b,  a  \a  allowed  to  proceed  immediately,  and 
it  may  still  be  possible  to  validate  both  transactions,  having  6  wait  on  the  validation  of  a 
when  b  requests  validation  if  this  cass  arises  (and  if  this  does  not  cause  deadlock).  The 
options  for  this  poligy  would  be:  read  -  grant,  write  •  wait,  read/write  -  wait  aU,  validation  • 
wait. 

Similarly,  the  optimistic  policy  could  be  modified  so  that  the  wait  option  is  aelscted  for  a 
validation  request  from  a  with  respect  to  ail  transactions  b  -*  a  -  the  philosophy  behind  this 
policy  might  be  to  retain  the  "never-watt"  property  of  the  optimistic  policy  for  the  read- 
phases  of  transaction,  but  to  wait  if  necessary  at  the  validation  point  in  order  to  determine  if 
possible  conflicts  turn  into  true  conflicts.  The  options  for  this  pokey  are:  read,  write, 
read/write  >  grant,  validation  •  watt. 

Finally,  policies  that  select  only  Mil  or  die  options  may  seem  uninteresting,  but  such  poNciee 
could  conceivably  prove  useful  in  some  applications  due  to  their  extreme  simplicity:  for 
•Wy,  ■»£,  and  -*  relations  are  unused,  and  so  need  net  be  maintained. 
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6.  Global  Memory  Managers 

In  this  chapter  the  transaction  support,  query  support,  and  garbage  collection  functions  of 
GMMs  wW  be  described. 

6.1.  Memory  Managment 

The  GMM  must  allocate  and  de- allocate  storage  space  for  objects,  maintain  the  mapping 
between  die  virtual  description  (ID  and  version  number)  of  an  object  and  its  physical 
addresafes),  and  find  certain  versions  of  an  object  given  its  ID.  Since  none  of  these 
problems  seem  significantly  new  (in  fact,  several  existing  multi-version  file  systems  solve 
these  problems),  discussion  of  such  a  multi-version  object  system  wilt  be  omitted  here. 

6.2.  Transaction  Support 

During  the  read-phase  of  a  transaction,  upon  receiving  Mread,  the  GMM  must  find  the  most 
recent  version  of  the  object  requested.  If  the  version  number  is  different  than  that  of  the 
local  copy  (if  any),  the  LMM  will  then  read  the  new  version  of  the  object  from  shared 
memory.  That  the  transaction  "sees"  a  consistent  database  must  be  ensured  by  the  CC. 
The  GMM  must  also  attempt  to  claim  space  in  shared  memory  for  new  objects  and  new 
versions  of  objects,  as  requested. 

During  the  write-phase,  the  GMM  updates  the  mapping  from  virtual  descriptions  to  physical 
addresses  of  each  new  version  or  new  object  as  it  is  written.  Then,  upon  receiving  mend, 
the  GMM  will  update  a  write-phase  completion  Hat  ( WPCL ),  and  possibly  update  a  write- 
phase  completion  counter  ( WPCC ).  The  WPCC  is  defined  as  the  largest  transaction  number 
such  that  the  corresponding  transaction  and  an  lesser-numbered  transactions  have 
completed  their  write-phases;  the  WPCL  is  defined  as  the  list  of  transaction  numbers  greater 
than  or  equal  to  the  WPCC  of  aH  such  transactions  that  have  completed  their  write-phases. 
This  should  be  made  clear  by  the  Mowing  example. 

Assume  that  aH  transactions  numbered  1093  and  less  have  completed  their  write- 
phases,  and  that  the  transactions  numbered  1095,  1096,  and  109B  have  alao 
completed  their  write- phases.  Then  the  WPCC  is  currently  1093,  and  the  WPCL  is: 

1093, 1095, 1096, 1066. 

Continuing  the  example,  upon  receiving  ktTend  for  the  transaction  numbered  1100, 
the  WPCL  becomes; 

1093, 1096, 1096, 1096, 1100, 

and  die  WPCC  remains  unchanged.  However,  upon  receiving  mend  for  the 
transaction  numbered  1064,  the  WPCC  becomes  1096,  and  the  WPCL  becomes; 
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The  WPCC  and  WPCL  are  of  use  in  query  support  and  garbage  collection,  as  discussed 
below,  and  in  recovery,  as  mentioned  in  Chapter  3. 

6.3.  Query  Support 

Suppose,  as  a  query  begins,  the  WPCL  la: 

1096. 1096.1100. 

At  this  point,  a  consistent  version  of  the  database  can  be  observed,  without  any  concurrency 
control,  by  accessing  for  each  object  10  the  greatest  version  of  the  object  that  is  less  than  or 
equal  to  1096.  This  is  the  most  recent  version  of  the  entire  database  that  can  be  guaranteed 
to  be  consistent  since,  given  an  object  10,  whether  or  not  there  may  later  be  a  version  1097, 
1099,  or  a  version  greater  than  1100  of  this  object,  cannot  now  be  determined. 

By  associating  the  current  value  of  the  WPCC  with  each  query  as  its  query  number  ( QN ) 
upon  Q  beg  in,  and  thereafter  sending  that  query  the  greatest  version  of  each  object 
requested  that  is  less  than  or  equal  to  its  QN,  queries  will  always  observe  a  consistent 
database,  without  any  CC  support. 

A  possible  problem  with  this  scheme  can  be  illustrated  by  the  following  example. 

A  user  executes  a  transaction,  the  transaction  is  successful  and  is  numbered  1096, 
and  when  its  write-phase  completes  the  WPCL  becomes: 

1096.1096.1100, 

with  a  WPCC  of  1096.  But  now,  if  the  same  user  executes  a  query  before  the  write- 
phase  of  the  transaction  numbered  1097  completes,  the  effect  of  the  user's  previous 
transaction  (numbered  1096)  will  not  be  visible! 

This  problem  arises  only  since  write-phases  we  allowed  to  take  place  asynchronously,  which 
is  highly  desirable  for  efficiency  in  the  kinds  of  multiprocessor/network  applications  of 
concern  here.  If  the  above  example  represents  a  true  problem,  one  solution  is  to  restrict 
write-phases  to  be  sequential  in  transaction-number  ordtr,  which  may  be  acceptable  in  a 
centralized  system. 

In  de-centraHzed  systems,  though,  other  alternatives  are  more  attractive.  A  scheme  involving 
asynchronous  notification  of  application  programs  of  the  occurrence  of  certain  events  (such 
as  WPCC  1096)  is  feasible.  Another  solution  is  to  allow  queries  to  "pick"  their  own  ON  - 
in  the  example  above,  the  query  could  pick  the  transaction  number  of  the  completed 
transaction,  1096,  as  its  QN.  Then,  the  GMM  could  be  designed  so  as  to  reply  to  Qbegin , 
but  postpone  the  reply  until  the  WPCC  became  greater  than  or  equal  to  the  ON  of  a  given 
query,  and  the  LMM  could  be  designed  so  as  to  wait  for  such  a  reply.  However,  in  order  tar 
garbage  collection  (see  below)  to  be  correct,  queries  must  not  be  allowed  to  pick  QNa  leas 
than  the  WPCC. 

Finally,  there  is  another  alternative,  in  which  the  GMM  associates  a  copy  of  the  current 
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WPCL  with  each  query  as  tha  query  begins.  In  the  exempts  above,  the  list  1066, 1006, 1100 
could  be  associated  with  the  query.  In  this  alternative  the  GMM  finds  for  each  read  request 
the  greatest  version  of  the  object  that  either  (1)  is  leas  than  the  minimum  transaction  number 
in  the  associated  WPCL  copy,  or  (2)  appears  in  the  associated  WPCL  eon-  This  provides  a 
consistent  view  of  the  database  since,  referring  to  the  example  above,  the  CC  has 
guaranteed  that  there  are  no  conflicts  between  the  transactions  numbered  1067  or  1006  and 
any  leeaer- numbered  transaction. 

6.4.  Garbage  Collection 

Each  time  a  new  version  of  an  object  is  created,  the  immediately  lesser- numbered  version  of 
the  object  becomes  potential  garbage  -•  "potential"  garbage  since  there  may  currently  be 
queries  executing  that  will  need  to  access  this  version.  Whether  or  not  an  object  is  "true" 
garbage  can  be  determined  by  the  current  minimum  value  ot  all  QNs,  say,  min  ON.  If  this 
number  is  greater  than  or  equal  to  the  version  number  of  the  new  version  of  an  object,  the 
preceding  version  can  then  be  garbage- collected,  since  ail  current  and  future  queries  will 
now  access  versions  of  this  object  with  version  numbers  greater  than  or  equal  to  that  of  the 
new  version. 

Note:  in  the  case  that  WPCL  copies  are  associated  with  queries,  then  taking  the  ON  of  a 
query  to  be  the  minimum  of  its  associated  WPCL  copy,  the  above  reasoning  stW  applies. 

Garbage  can  be  collected  as  soon  as  it  is  generated  by  maintaining  a  garbage  list  (GL)  as 
follows.  The  gabage  ttst  is  a  list  of  (version  number,  version  set)  pairs,  whore  a  version  set 
is  a  set  of  (object  to.  version  number )  pairs  •-  i.e.,  each  element  of  a  version  set  refers  to  a 
particular  version  of  a  particular  object.  If  a  new  version,  with  say  version  number  NV,  of  the 
object  with  ID  ID  is  written,  and  OV  is  the  version  number  of  the  preceding  version 
(assuming  there  is  one),  then  the  GL  is  updated  by  adding  (ID,  OV)  to  the  version  set  in  the 
GL  associated  with  version  number  NV  (creating  a  new  version  set  if  necessary).  In  the  case 
that  the  new  version  is  a  deleted  version,  (ID,  NV)  is  also  added  to  this  set.  Potential 
garbage  objects  can  now  be  collected  as  soon  as  they  become  true  garbage  by  treeing  ail 
objects  in  the  version  sets  of  the  garbage  list  associated  with  version  numbers  NV  teas  than 
or  equal  to  min  ON,  for  each  new  value  of  min  QN.  Finally,  min  ON  can  be  continuously 
updated  by  recalculation  upon  each  query  completion,  or  it  could  be  periodically  updated. 
An  example  is  as  follows. 

Let  the  GL  currently  be 

(1066,  {(1,1066),  (2,1001)}),  (1067,  {(1,1066),  (4,666)}),  (1066,  {(3,1001))), 

and  let  min  QN  •  1066.  Now,  if  version  1067  of  the  object  with  ID  S  is  written,  md 
the  largest  numbered  version  of  this  object  was  previously  1060,  the  GL  becomes: 

(1086,  {(1,1066),  (2,1001)}),  (1087,  {(1,1086),  (4,866).  (6,1060))),  (1086,  {(3,1001))). 
Later,  after  all  queries  with  ON  ■  1066  have  completed,  min  QN  increases,  to  1016 
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say  (i.e.,  assume  there  is  still  an  uncompleted  query  with  ON  •  1096).  Then,  alter 
garbage-collecting  versions  1095  of  object  1  and  1001  of  object  2,  the  GL  becomes: 

(1097,  {(1 ,1096),  (4,996),  (5,1050)}),  (1096,  {(3,1001 )}). 

6.5.  Partitioning  the  Global  Memory  Manager 

Parallelism  can  be  introduced  in  the  GMM  by  partitioning  the  set  of  shared  data  objects  in 
some  fashion  (for  example,  if  there  are  several  secondary  memory  devices,  if  the  constraint 
is  made  that  all  versions  of  an  object  must  be  stored  on  die  same  device,  an  object  can  be 
mapped  to  a  partition  corresponding  to  the  device  on  which  the  object’s  versions  are 
stored),  and  by  using  a  process  for  each  partition  to  manage  storage  allocation  and  the  ID, 
version  »>  physical  address  mapping  for  objects  in  that  partition.  Furthermore,  the  GL  can 
be  partitioned  in  the  same  fashion:  if  object  p  belongs  to  partition  /,  potential  garbage 
versions  of  p  are  recorded  in  garbage  list  GU,  say.  Each  GL  can  be  managed  by  a  separate 
process,  with  a  message  interface  between  mapping  and  GL  processes,  or  if  the  mapping 
and  GL  partition  schemes  are  identical,  the  mapping  processes  can  also  perform  garbage- 
collection.  In  order  to  balance  free  storage  among  the  mapping  processes,  an  additional 
process  could  be  used  to  generate  new  object  IDs,  with  IDs  chosen  in  such  a  fashion  that 
each  newly  created  object  maps  to  that  partition  containing  the  most  free  storage. 
Alternatively,  for  computer  network  applications  for  example,  the  LMM  could  always  first  try 
object  creation  via  a  mapping  process  that  was  "close”  in  the  network,  with  other  mapping 
processes  used  if  this  fails. 

The  WPCL,  WPCC ,  and  QN  information  are  anaiagous  to  transaction  information  for  the 
CC  ••  these  structures  can  be  managed  by  a  single  process,  or  by  sovoral  prooNNi 
accessing  these  structures  in  shared  memory. 

There  does  not  seem  to  be  any  straight-forward  way  to  distribute  the  WPCL,  WPCC,  and  QN 
information  in  a  way  that  would  offer  any  performance  advantages  --  this  is  because  these 
structures  are  all  intimately  connected  with  the  central  transaction  number  counter  of  the 
CC.  In  the  types  of  proposed  systems  mentioned  in  Section  4.6  in  which  the  use  of  a  central 
transaction  number  counter  is  avoided  by  using  timestamps  or  other  schemes,  these 
structures  are  simply  omitted.  The  results  are  that  queries  must  be  controlled  by  the  CC 
(however,  by  having  queries  access  sufficiently  old  versions  of  the  database,  queries  will 
almost  never  be  aborted),  and  that  there  do  not  seem  to  be  any  algorithms  for  garbage 
collection  other  than  heuristic  techniques  (for  example,  it  might  be  assumed  that  any 
potential  garbage  version  more  than  a  day  old  could  be  deleted). 
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7.  Transaction  Processing  on  Cm* 

In  order  to  (1)  develop  algorithms  for  the  concurrency  control  designs  previously  presented, 
(2)  experimentally  verify  the  correctness  of  these  algorithms,  (3)  investigate  the  limitations  of 
multiprocessor  systems  for  transaction  processing,  and  (4)  demonstrate  the  usefulness  of  the 
policy  module  approach  for  policy  experimentation  on  a  complex  system,  a  transaction 
processing  system  was  implemented  on  Cm ‘/Medusa.  In  this  chapter  this  system  is 
described,  the  results  of  the  experiments  are  presented,  and  some  implications  of  these 
results  are  discussed.  The  concurrency  control  algorithms  are  given  in  Appendix  N. 

7.1.  Overview  of  CmVMeduea 

Cm*,  a  distributed  multi  microprocessor  designed  and  built  at  Carnegie- Mellon  University 
(see  [Swan  et  al  77]),  currently  consists  of  50  computer  modules  (Cms)  and  five 
communication  controllers  (Kmaps),  as  shown  in  Figure  7.1.  Each  Cm  consists  of  a  DEC 
LSI-11  microprocessor,  primary  memory  of  64K  or  128K  bytes,  various  devices,  and  a  local 
switch  (Slocal).  The  Slocal  contains  relocation  tables  that  allow  each  memory  reference  to 
be  mapped  either  to  memory  or  devices  on  the  associated  LSI-11  bus  (a  local  reference)  or 
to  be  passed  to  the  Kmap  for  the  cluster  (a  non-local  reference).  Each  Kmap  is  a 
microprog  ram  mabie  microprocessor  specially  designed  as  a  communication  controller,  and  is 
responsible  for  mapping  non-local  memory  references  either  to  another  Cm  in  the  same 
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cluster  (an  Intracluster  reference),  or  through  another  Kmap  to  a  Cm  in  a  different  rlnstsr 
(an  intercluster  reference).  However,  because  the  Kmaps  are  microprogrammable,  this 
mapping  can  take  piece  in  many  different  and  complex  ways.  In  particular,  it  is  pnasHila  to 
microprogram  key  operating  system  communication  primitives  in  the  Kmaps. 

Medusa  (aee  [Outsterhout  et  ai  60))  is  one  of  two  operating  systems  designed  and 
implsmsnled  for  Cm*  (the  other  is  StarOS  -  for  more  detailed  descriptions  of  Cm*,  Medusa, 
and  StarOS,  along  with  a  variety  of  information  regarding  Cm '-rotated  research,  see  the 
research  review  [Jones  and  Gehringer  80]).  The  two  primary  uses  of  the  Kmaps  under 
Medusa  are  for  message  communication  and  address  mapping  -  these  functions  are 
implemented  in  the  Kmap  microcode.  Message  communication  in  Medusa  takes  piece  using 
objects  called  pipes,  and  is  an  extension  of  the  Unix  pipe  mechanism  (see  [Ritchie  and 
Thompson  74]).  Here,  it  need  only  be  noted  that  the  extensions  are  such  that  the 
assumptions  of  Section  3.1  regarding  communication  between  subsystems  can  be  sattafled 
using  mechanisms  already  provided. 

Medusa  provides  a  structure  caRed  a  task  tone  to  implement  operating  system  functions  and 
user  programs.  A  task  force  is  a  collection  of  activities  (or  processes),  each  of  which  can 

reference  a  distinct  collection  of  private  objects,  such  as  code  pages,  and  aR  of  which  am 
reference  a  single  coWsction  of  shared  objects,  such  as  communication  pipes  or  shared  data 
pages.  Access  to  an  object  is  gained  through  a  descriptor  Hat;  thus,  for  each  task  force 
Ihere  is  a  shared  descriptor  Hat  (SOL),  and  for  each  activity  there  ie  a  private  descriptor  Pet 
(POL)  This  task  force  structure  is  shown  in  Figure  7.2.  The  mapping  of  an  access  to  an 
object  through  a  descriptor  Rot  la  supported  by  the  Kmap  microcods;  in  particular, 


so 
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descriptors  ore  cached  in  the  Kmsp,  so  that  access  to  a  non-local  object  can  usually 
proceed  without  the  extra  step  of  fetching  a  descriptor  from  a  descriptor  list.  In  the  case 
that  the  object  is  local  to  the  activity  accessing  the  object,  a  simple  access  (i.e.,  an  access 
produced  by  an  LSI-11  instruction)  can  proceed  directly  through  the  Smap  without  involving 
the  Kmap.  Because  of  this,  aH  code  and  private  data  pages  are  typically  made  local  (if 
possible)  for  performance  reasons. 

Medusa  supports  various  operating  system  functions  through  a  collection  of  task  forces 
called  utilities.  Utilities  are  special  kinds  of  task  forces  in  that  among  other  differences,  a 
descriptor  list  of  input  pipes  to  the  utilities,  called  the  utility  descriptor  list  (UDL),  is  stored  in 
each  Cm.  Thus,  any  activity  running  on  any  Cm  can  invoke  an  operating  system  function  by 
sending  a  message  to  the  utility  implementing  that  function  through  the  appropiate  pipe  in 
the  UDL.  Of  the  utilities  provided  by  Medusa,  the  only  one  that  was  used  during  the  course 
of  the  experiments  described  below  was  the  file  system  (other  utilities  were  used  during 
startup  and  after  the  completion  of  experiments).  The  file  system  utility  handles  aH  input  and 
output  devices,  and  provides  a  hierarchical  file  structure. 

7.2.  The  Transaction  Processing  System 

The  transaction  processing  system  was  implemented  as  a  single  task  force  of  eleven 
activities:  a  msster  activity,  eight  trsnsaction-processor  activities  (TP1,  TP2,  ....  TPS),  a  CC 
activity,  and  a  QMM  activity.  The  SDL/POL  structure  of  this  task  force  is  shown  in  Figure 

7.3.  For  the  shared  Cm  memory  system  (see  below),  the  SDL  shown  in  the  figure  was 
extended  to  include  48  descriptors  for  shared  Medusa  (4008-byte)  pages. 

All  experiments  took  place  using  a  three-cluster  partition  of  the  system.  In  each  case,  a ■ 
activities  (including  utility  activities)  were  allocated  their  own  Cm,  and  aH  code,  alack,  and 
data  pages  were  local.  Since  Medusa  did  not  support  context  swaps,  activities  were  always 
resident  in  their  respective  Cms.  The  data  objects  supported  at  the  virtual  internal  level  were 
512-byte  pages. 

The  CC  activity  implemented  all  functions  needed  by  the  CC  paradigm,  and  a  policy  module 
providing  all  basic  policies  was  used,  with  deadlock  detection  and  request  queueing 
extensions  as  described  in  Section  5.2.  At  the  start  of  each  experiment  a  poficy  was  chosen 
by  sending  the  CC  activity  a  message  containing  the  options  to  be  used  by  the  poficy 
module.  For  the  policy  experiments,  the  following  four  policies  were  used. 

locking:  read,  write,  read/write  •  welt  (the  validation  option  is  redundant), 
lock-opt  read  •  grant,  write  •  waft,  read/write  •  waft  ail,  validation  •  waff, 
opt-lock:  read,  write,  read/write  •  grant,  validation  *  waff, 
optimistic:  read,  write,  read/wrtte  •  grant,  validation  •  ffflf. 
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Figure  7.3.  Transaction  Processing  System  Structure 


The  GMM  activity  used  the  first  scheme  for  query  support  of  Chapter  0.  The  10,  version 
■  >  address  mapping  used  by  the  GMM,  along  with  the  WPCL,  QL,  QN  table,  etc.,  were 
stored  in  primary  memory. 

1110  vxfuciufw  of  me  uaniMKHi  processor  ecmeoee  wee  w  voeovnv. 


transaction /query  generator  ■  generate  random  insertions,  delations,  and  queries. 

RcdM  -  implement  a  flte  structure  based  on  a  collection  of  512-byte  pages  as 
shown  in  Figure  7.4  -  this  RcdM  attempted  to  optimize  storage  use  by  packing  near 
rococo*  imo  exieonQ  recofu  pegee  eofeconi  unoer  me  twr  pege  ov  me  nm  wmsm*  » 

LMM  •  local  mamarv  menaoar  —  al  local  owmorv  that  was  avafiable  after  sflocaftaa 
for  code,  data,  and  stack  was  used  as  LMM  cache  space,  with  a  resulting  spenbauai  .  . 
cache  sin  of  42  512-byte  pages  (Ms  space  was  also  used  for  write- phase  support), 
and  fie  page-replacement  policy  was  LRU  (least  recently  uesd  page  mpteoed  tnMfc 
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Figure  7.4.  File  Structure  Used  by  Record  Manager 

and  executed  random  insertions,  deletions,  or  queries  to  find  one  tuple,  based  on  messages 
from  the  master  activity  giving  the  relation  name,  operation,  and  miscellaneous  additional 


in  aS  of  the  exoarimante.  a  previously  created  database  consiatina  of  500  tuolas  in  three 
relations  was  used.  Although  tee  RcdM  supported  portable  length  tuples,  lor  tee  eapeftaxenta 
only  fixed  size  tuotea  were  used.  The  three  rotations  ware  as  fodowK  rotation  A  had  4 
domains,  of  lengths  10,  10,  3,  and  10,  with  Indexes  on  tuple  IDs  snd  the  IMK  dttmata; 
rotation  B  hod  5  demotes,  of  tangtes  10, 10, 1,  2,  snd  10,  with  tedexes  on  tupta  ttK  ahd  tea 
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the  contents  of  the  tuple.  A  query  to  find  one  tuple  first  randomly  selected  a  relation  (any 
relation  equally  likely),  then  selected  an  index  (any  index  lor  the  relation  equally  likely),  then 
selected  an  index  value  (any  index  value  equally  Hkely),  and  then  found  and  retrieved  the 
first  tuple  with  a  corresponding  domain  value  greater  than  or  equal  to  the  key  if  there  was 
such  a  tuple,  otherwise  the  tuple  with  the  maximal  value  for  that  domain  was  retrieved.  A 
deletion  proceeded  in  the  same  fashion  as  a  query,  except  that  the  retrieved  tuple  was 
deleted.  If  a  transaction  failed  due  to  a  conflict,  the  master  activity  always  immediately  sent 
a  message  to  the  transaction  processor  activity  to  repeat  the  transaction.  Below,  this  event 
is  referred  to  as  a  -rattan. 

Although  this  sytem  used  artificially  generated  transactions,  it  was  besed  on  an  earlier  "real" 
system  (i.e.,  usable  for  applications).  This  system  relied  on  the  unique  identification  of  tuples 
to  support  interactive  examination  and  modification  of  the  database  without  user  interaction 
during  the  course  of  a  query  or  transaction.  Although  the  only  transactions  defined  were 
insertions  and  deletions,  and  all  defined  queries  were  queries  to  find  a  single  tuple  (In 
various  ways),  the  user  was  generally  given  the  appearance  of  exclusive  interactive  access 
to  the  database  by  "remembering"  the  state  of  the  user,  in  particular  the  contents  and  the 
unique  identification  of  the  most  recently  accessed  tuple,  between  transactions  and  queries. 
Thus,  the  master  activity  was  designed  to  simulate  to  a  limited  extent  the  behavior  of  a 
number  of  user  interfaces  of  this  type.  In  practice,  the  master  activity  would  be  replaced  by 
a  collection  of  user  interface  activities,  as  shown  earlier  in  Figure  2.5. 

In  addition  to  driving  the  transaction  processor  activities,  the  master  activity  collected  a  trace 
of  the  experiment.  In  order  to  see  what  information  was  collected  during  a  trace,  part  of  a 
trace  file  «  shown  in  Figure  7.5. 

7.3.  Maximum  Throughput  Experiments 

As  there  are  few  multiprocessor  transaction  processing  systems  in  existence,  their  Hmitetions 
are  of  interest  Using  the  locking  and  optimistic  policies,  experiments  were  performed  to 
investigate  the  maximum  throughput  as  the  number  of  transaction  processor  activities  was 
increased.  In  these  experiments  100  insertions  or  deletions  were  performed,  either  equally 
Hkely.  Shared  memory  was  accessed  by  random  fHe  access  through  the  Medusa  (He  system. 
The  master  activity,  upon  receiving  a  completion  message  from  a  transaction  processor 
activity,  always  immediately  sent  a  message  to  begin  a  new  transaction  (if  there  were  any 
transactions  left  to  perform).  The  observed  throughputs  are  shewn  in  Figure  7 A. 

Separate  experiments  with  the  Medusa  file  system  determined  that  the  ffie  system  activity 
became  a  bottleneck  for  this  system  as  the  number  of  transaction  processors  increased.  In 
order  to  see  the  effects  of  removing  this  bottleneck,  a  new  system  was  developed  In  which 
the  unused  memory  of  four  128K  Cms  was  used  as  shared  memory,  accessed  through  the 
SOL.  (the  Me  system  was  stiH  used  by  the  master  activity  to  read  clock  values  and  to  write 
the  trace  file).  Reads  or  writes  of  51 2- byte  pages  were  performed  using  a  block  Move 
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time  transection  processor  number  command 
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Figure  7.8.  Part  of  an  Experiment's  Trace 

operation  provided  by  the  Medusa  Kmap  microcode.  This  system  was  not  believed  to  be 
unrealistic,  since  it  is  possible  to  transfer  data  to  disks  at  the  maximal  Kmap  block  move  rate 
of  approximately  300  512-byte  blocks/second.  The  observed  throughputs  for  this  new 
system  are  Shown  in  Figure  7.7. 

These  experiments  dearly  show  that  significant  increases  in  throughput  are  possible  for 
transaction  processing  using  multiprocessor  architectures,  even  when  the  database  is  highly 
shared.  However,  there  are  two  limitations  on  the  increases  that  can  be  achieved:  shared 
memory  bandwidth  and  transaction  conflict 

With  respect  to  shared  memory  bandwidth,  using  the  Medusa  file  system,  no  increases  in 
throughput  could  be  achieved  with  more  than  four  transaction  processors.  The  bottleneck 
would  have  occurred  even  earlier  tf  objects  were  not  cached  in  local  memory:  in  the  case  of 
the  four  transaction  processor  locking  experiment  using  the  his  system,  out  of  an  average  of 
14.04  read  requests  to  the  GMM,  an  average  of  only  5.72  pages  had  to  be  reed  hem  shared 
memory,  giving  a  "cache-hit"  ratio  of  80%  (the  meaning  of  caChe-Mt  here  is  somewhat 
unique,  in  that  once  the  LMM  determines  that  a  local  copy  le  the  connect  version,  no  further 
messages  to  die  GMM  take  pi ace).  This  ratio  was  typical  for  a*  anpertmeida.  *rf 
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Other  approaches  to  reducing  the  shared  memory  bandwidth  limitation  rely  on  RcdM  tf  salon 
As  noted  above,  me  RcdM  used  for  these  experiments  attempted  to  optimize  storage  by 
packing  records  into  "nearby"  record  pages  if  possible  (this  RcdM  was  earlier  developed  on 
a  system  in  which  secondary  storage  was  at  a  premium).  The  result  is  that  for  soma 
transactions,  a  large  number  of  pages  were  examined  (for  the  four  transaction  pmrssanr 
eking  experiment  using  the  file  system,  the  maximum  number  of  read  requests  to  the  GMM 
for  any  one  transaction  was  23).  A  RcdM  using  a  simpler  record  storage  allocation  scheme 
would  encounter  the  shared  memory  bottleneck  at  a  later  point;  however,  storage  would  be 
utilized  less  effectively,  and  queries  could  be  more  expensive.  As  noted  in  Appendix  I,  there 
are  many  alternatives  in  RcdM  design,  and  it  is  a  field  of  on-going  research. 

The  most  direct  way  to  approach  this  limitation,  though,  is  to  increase  shared  memory 
bandwidth.  In  the  experiments  using  shared  Cm  memory,  for  example,  a  dedicated  disk 
controller  could  be  used  on  each  shared  memory  Cm,  with  the  primary  memory  of  the  Cm 
used  as  a  large  buffer.  This  example  illustrates  two  techniques:  provide  more  parallelism  in 
the  path  to  shared  memory  (multiple  disk  controllers),  and  use  intermediate  levels  in  the 
memory  hierarchy  (primary  Cm  memories). 

The  transaction  conflict  limitation  is  more  difficult  to  avoid,  since  it  has  the  effect  of  making 
additional  parallelism  useless:  if  there  is  a  conflict  between  two  transactions,  they  must  (in 
the  general  case)  proceed  sequentially. 

Again,  RcdM  design  plays  an  important  part  If  records  were  not  indexed  under  their 
reversed  IDs  in  these  experiments,  there  would  have  been  conflicts  between  almost  every  est 
of  concurrent  insertions  to  the  same  relation.  On  the  other  hand,  a  RcdM  using  a  simpler 
record  storage  allocation  scheme  could  have  had  fewer  conflicts  since  the  read  sat  sizes 
would  have  been  smaller. 

There  is  also  a  problem  at  the  virtual  internal  level:  if  conceptual  entities  are  mapped  to 
larger  internal  entities,  conflicts  can  occur  between  transactions  that  do  not  conceptually 
conflict.  Given  the  framework  of  Chapter  2,  the  only  solution  is  to  decrease  foe  granularity 
of  foe  objects  provided  at  the  internal  level.  Other  approaches  rely  on  introducing 
application-dependence  into  the  concurrency  control  so  that  larger  daises  of  transaction 
histories  are  allowed  (e.g.,  see  [Kung  &  Papadimitriou  79)),  but  are  beyond  foe  scope  of  this 
work. 

The  effects  of  transaction  confkct  cm  be  seen  in  Figures  7.6  and  7.7,  pardcuiariy  in  foe  case 
of  the  optimistic  pokey  -  for  the  optimistic  policy,  when  a  transaction  is  validated,  si 
conflicting  transactions  are  aborted.  In  these  experiments,  foe  degree  of  concurrency 
increased  as  the  number  of  transaction  nrocsiiori  jnrmnninl  and  so  the  orobabkily  of 
conflict  Increased  as  wok.  This  effort  can  be  seen  as  an  increase  in  the  average  number  of 
restarts  for  each  transaction,  as  shown  in  Figures  7.8  and  7.9.  Mots  that  restarts  oocur 
much  less  usina  the  iockina  ookev.  since  a  transaction  is  aborted  ontv  I  achadukna  Ifo 
request  would  cause  daadfoch. 


Figure  7.9.  Number  of  Restarts  using  Kmap  Block  Moves 
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For  the  locking  policy,  increased  transaction  conflict  results  in  increased  waiting  due  to 
scheduling.  However,  as  concurrency  increases,  there  is  also  increased  waiting  on  shared 
system  resources  (for  this  system,  these  are  the  file  system,  the  Kmapa,  and  the  master, 
GMM,  and  CC  activities).  The  average  execution  times  (including  waiting  times)  for 
transactions  that  were  successful  on  the  first  attempt  under  the  locking  and  optimistic 
policies  are  shown  in  Figures  7.10  and  7.1 1.  Since  these  curves  are  almost  identical,  and 
under  the  optimistic  policy  there  is  no  waiting  due  to  scheduling,  it  must  be  concluded  that 
for  this  system  waiting  due  to  scheduling  is  negligible  as  compared  to  waiting  on  shared 
system  resources.  This  explains  why  the  locking  policy  generally  gave  higher  throughput: 
since  waiting  due  to  scheduling  is  negligible,  the  effect  of  restarts  dominates.  In  those  three 
cases  in  which  the  optimistic  policy  gave  slightly  higher  throughput,  there  was  by  chance  a 
combination  of  less  total  work  to  perform  using  the  optimistic  policy  (where  total  work  was 
measured  by  the  total  number  of  shared  page  accesses)  and  a  relatively  small  difference  in 
restarts  for  the  two  policies.  This  can  be  seen  by  comparing  Figures  7.12  and  7.13  with  the 
previous  figures. 

7.4.  Policy  Experiments 

Using  a  policy  module  that  provides  many  policies,  it  is  easy  to  investigate  the  performance 
of  policies  on  a  complex  system.  Using  the  locking,  lock-opt,  opt  lock,  and  optimistic 
policies,  a  number  of  experiments  were  conducted  as  a  demonstration.  The  experiments 
were  as  follows. 

1.  The  shared  Cm  memory  system  was  used,  with  eight  transaction  processors. 

2.  500  queries,  insertions,  or  deletions  were  performed. 

3.  Transaction  processor  activities  waited  a  randomly  generated  amount  of  time 
between  transactions  and  queries,  from  0  to  2  seconds. 

4.  The  probability  of  a  query  was  1  /2,  and  insertion  and  deletion  probabilities  were 
each  1/4. 

5.  Three  different  experiments  with  three  different  inital  databases  were  conducted 
for  each  policy  by  varying  the  initial  seed  for  the  random  number  generator. 

The  results  of  these  experiments  are  shown  in  Table  A. 

Since  for  the  Cm*  system  the  effect  of  waiting  due  to  scheduling  is  negligible,  the  locking 
policy  uniformly  gave  the  best  throughput  as  expected.  However,  in  the  first  set  of 
experiments,  the  difference  in  transaction  conflicts  between  the  locking  policy  and  the  other 
policies  was  less  than  in  the  latter  two  sets  of  experiments.  The  result  was  that  in  this  first 
set  of  experiments,  both  the  optimistic  and  lock-opt  policies  gave  better  average  response 
times  than  the  locking  policy,  with  the  lock-opt  policy  giving  the  best  response  time  -  the 
increased  restarts  were  not  enough  to  cancel  out  the  decreased  response  time  reauMng 
from  less  scheduling.  In  the  tatter  experiments  the  difference  In  restarts  was  greater,  and 
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since  response  time  includes  the  time  taken  by  unsuccessful  transactions,  the  locking  policy 
gave  the  best  response. 
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TME 

TIME 
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A  Averaged  over  a*  completed  transactions  or  queries. 

B  abated  pages  read,  shared  pagea  written. 

C  FTS  ■  flrat  Sms  auccaas  (including  quarias),  FTF  -  Ihet  time  failure,  R8  *  restart  auccaaa.  RF  •  restart 
Mure. 

0  Timas  in  seconds. 

B  Comoietad  tranaacdona  or  ouariaa/aacond. 

F  Did  not  occur. 
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8.  Conclusion 

Now  that  the  four-level  design  framework,  concurrency  control  paradigm,  global  memory; 
manager  designs,  and  Cm*  experiments  have  been  presented,  various  issues  concerning  this 
approach  to  transaction  processing  system  design  can  be  considered  in  more  detaR.  in  this 
concluding  chapter,  first,  various  of  these  issues  are  considered  in  turn.  Next,  conclusions 
drawn  from  the  implementation  experience  are  given,  and  implications  of  the  Cm* 
experiments  are  discussed.  Finally,  some  directions  for  future  research  are  identified. 

8.1.  On  the  Use  of  Physical  Pointers 

Currently,  it  is  common  practice  to  use  physical  pointers  to  data  objects  in  database  record 
managers.  The  efficiency  advantage  is  that,  given  a  pointer  to  an  object,  tee  object  can  be 
retrieved  without  first  looking  up  the  physical  address  corresponding  to  the  pointer. 
However,  this  advantage  disappears  as  soon  as  a  memory  hierarchy  is  introduced.  Using  a 
memory  hierarchy,  an  object  can  possibly  be  stored  at  several  locations,  and  so  some  form 
of  lookup  is  necessary  in  any  case.  It  seems  clear  teat  memory  hierarchies  are  necessary  in 
all  but  centralized  unshared  database  applications. 

8.2.  On  Multi-Version  Objects 

Once  it  has  been  decided  that  objects  will  be  referred  to  virtually,  and  that  ID  «>  address 
mappings  will  be  maintained,  there  are  clear  advantages  to  extending  these  mappings  to 
support  multi-version  objects.  First,  as  observed  in  the  introduction,  it  is  common  in 
transaction  processing  systems  to  have  large,  queries.  Without  multi-version  objects,  queries 
can  observe  consistent  database  states  only  with  concurrency  control  support.  It  is  dearly 
undesirable  to  abort  large  queries,  and  so  some  policy  giving  priority  to  large  queries  must 
be  used.  Alternatively,  using  a  hierarchical  locking  concurrency  control,  the  query  could 
begin  by  read- locking  large  portions  of  the  database.  In  either  case  the  result  is  that  a 
large  portion  of  the  database  cannot  be  modified  by  transactions  while  the  large  query  is  in 
progress.  With  multi-version  objects,  using  one  of  the  schemes  for  query  support  presented 
eerlier,  queries  do  not  affect  concurrent  transactions  (except  perhaps  by  freezing  garbage- 
collection  -  see  below).  Some  other  advantagee  to  the  use  of  multi-version  objects  are  that 
in  distributed  systems  version  numbers  can  be  used  to  determine  when  copies  are  "out-of- 
date"  (as  in  the  Cm*  system),  and  that  versions  (together  with  a  write- phase- completion  Hat, 
counter,  or  similar  information)  form  a  basis  for  recovery  at  the  memory  management  level. 

An  objection  that  has  been  raised  to  multi-version  object  schemes  is  that  extra  storage  is 
necessary.  If  old  versions  are  quickly  garbage-collected,  this  doss  not  in  general  present  s 
significant  problem,  and  if  version  numbers  are  managed  by  conceptually  centraHzad  entities, 
such  garbage-collection  is  possible:  a  garbage-collection  algorithm  was  designed  and 
implemented  in  which  old  versions  are  freed  at  the  first  point  at  which  it  can  be  guaranteed 
that  no  future  transaction  or  query  wM  sc  esse  that  version.  It  is  ati>  posable,  though,  tor 
garbage  coHsction  to  bo  frozen  for  a  long  period  of  time  due  to  the  execution  of  a  large 
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query,  nowwi  w  liiyw  ftfiion  scvenwii  neny  vsnMcvont  wow  01  ^pon^io  ven.mM 
the  completion  of  the  large  query,  whereas  in  a  mufti- version  scheme,  «N  traneactione  can 
continue  to  run  unit  storage  is  exhausted.  Also,  it  is  posstols  to  transfer  ekf  verslona  to 
tertiary  memory,  as  noted  eerier  in  Section  3.4  ••  this  would  allow  the  large  query  to 
continue  to  run  (but  at  a  much  slower  rate),  and  would  free  secondary  memory  apace 

8.3.  On  Genera)  Concurrency  Controls 

Heving  removed  query  support  ss  s  concurrency  control  function,  the  problem  of  designing 
an  efficient  general-purpose  concurrency  control  »  greedy  simplified.  Nevertheless  the 
question  remains  whether  budding  record  manager  dependence  into  the  concurrency  control 
can  greatly  improve  efficiency.  In  terms  of  run-time  efficiency,  this  question  can  probably 
never  be  answered  in  a  final  way,  since  any  modification  of  a  transaction  proceeding 
system's  access  structures  could  in  principle  introduce  a  variety  of  new  specialised 
concurrency  controls,  all  of  which  would  have  to  be  compared  to  many  general  concurrency 
controls.  Furthermore,  even  given  a  demonstrably  good  specialized  concurrency  control  for 
some  access  structure,  there  are  currently  no  techniques  for  generalizing  the  concurrency 
control  to  the  case  in  which  the  access  structure  is  combined  with  other  structures.  For 
example,  none  of  the  special  locking  protocols  developed  for  B-trees  (see  [Samadi  79], 
[Bayer  and  Scholnick  77],  [Miller  and  Snyder  78],  [Ellis  80],  or  [Lehman  and  Yao  81])  can  be 
applied  directly  to  the  record  manager  used  in  the  Cm*  system  -  although  this  record 
manager  uses  B-tree  indexes,  there  is  no  global  tree  structure,  since  the  index  records  of 
several  B-trees  can  ail  point  to  the  same  tuple.  So  regardless  of  the  run-time  efficiency  of 
specialized  concurrency  controls,  there  are  clearly  development  and  maintenance 
advantages  for  general  concurrency  controls. 

Based  on  the  Cm*  experiments,  it  seems  that  general  concurrency  controls  can  provide 
enough  concurrency  to  effectively  utilize  parallelism,  giving  significant  increases  in 
throughput.  It  is  important  to  realize,  though,  that  these  results  depend  on  the  fact  that  a 
record  manager  was  used  in  which  conceptually  small  transactions  were  usually  phyaicaMy 
small  as  well,  and  in  which  conceptually  non -conflicting  transactions  were  usuaiy  phyaicaMy 
non-conflicting.  Since  such  properties  are  highly  desirable  for  record  managers  in  any  ooss, 
one  cannot  seriously  object  to  the  fact  that  the  efficiency  of  a  general  concurrency  control 
depends  on  these  properties.  However,  because  the  use  of  record  managers  with  these 
properties  is  so  important  in  the  four-level  architecture,  earner  work  on  the  problem  of 
designing  general  index  structures  for  such  record  managers  is  reported  in  Appendix  I. 

8.4.  Implementation  Experience 

two  conclusions  esn  m  orswn  from  ttm  un  crsnsscuon  pfocsssfflj  systsm  wnpismsms^n 
experience.  First,  the  four-level  framework  reeky  is  valuable  in  the  development  of 
transaction  processing  systems.  In  the  cast  of  the  Cm*  system,  the  record  manager  was 
earlier  developed  on  the  DEC  POP  10  architecture  under  an  operating  systsm  (TOPSIO) 
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comotatatv  dIHifint  from  Uadun.  utina  ■  difftrini  concurrs-cv  cootwl  and  a  dfaai 
memory  manager  (the  concurrency  control  was  an  sarty  implementation  of  tw  qptlMlejc 
method,  and  the  memory  manager  supported  only  single  version  objects).  Luckily,  dialects 
of  a  common  language  (BUSS  -  aee  [Wulf  at  al  71])  ware  available  on  both  aptfame.  but  tltia 
should  often  be  the  ease  for  high-level  languages.  The  only  modificatione  necessary  to  Via 
original  record  manager,  other  than  those  due  to  differences  in  the  language  dialects,  were 
due  to  a  lack  of  information-hiding  in  the  original  design  -  the  original  record  manager 
managed  its  own  local  paga  memory  ae  a  stack  of  page*  and  a  law  temporary  papas.  By 
replacing  these  pages  with  pointers  to  pages  and  by  moving  local  memory  management  to 
the  LMM,  a  more  general  design  resulted.  In  fact,  the  modified  record  manager  could  be 
used  on  both  systems,  and  wee  actually  debugged  and  tested  on  the  POP  10  system.  Since 
the  record  manager  was  by  far  the  most  complex  subsystem,  this  speeded  development  time 
immensely. 

Second,  if  a  concurrency  control  module  providing  aN  functions  needed  by  the  paradigm  la 
implemented  first,  it  is  then  easy  to  implement  any  particular  general  concurrency  control  as 
a  policy  module.  In  the  Cm*  system,  the  policy  module  providing  ait  basic  policies  wMh 
deadlock  detection  and  request  queueing  consisted  of  72  lines  of  (BLISS)  cods.  The  module 
providing  ail  functions  needed  by  the  paradigm  can  be  thought  of  as  a  kernel  in  the  HYDRA 
sense:  the  HYDRA  operating  system  kernel  was  defined  to  be  a  set  of  facilities  "which  are 
both  necessary  and  adequate  for  the  construction  of  a  large  and  interesting  class  of 
operating  environments”  [Wulf  St  al  74).  By  replacing  "a  large  and  interesting  dsas  of 
operating  environments"  with  "all  general  concurrency  controls",  a  general  concurrency 
control  kernel  is  defined. 

8.5.  Implications  of  the  Cm*  Experiments 

One  conclusion  that  might  be  drawn  from  the  Cm*  experiments  is  that,  since  the  effect  of 
waiting  due  to  scheduling  was  negligible,  future  investigations  of  concurrency  controls 
should  concentrate  on  locking-styie  policies.  This  conclusion  is  tentative  at  beat:  Cm*  it 
currently  a  unique  system,  partly  multiprocessor,  partly  computer  network,  and  it  is  not  at  all 
dear  that  this  result  applies  to  diesimiiar  systems.  Also,  in  these  experiments  individual 
processors  were  not  muttiprog rammed,  but  the  expense  of  scheduling  could  be  drastically 
increased  if  scheduling  required  context  swaps.  However,  using  a  concurrency  control 
policy  module  it  is  easy  to  perform  initial  experiments  on  any  system  to  determine  if  this 
same  situation  holds,  and  so  questions  regarding  the  general  applicability  of  this  result  atom 
somewhat  unimportant  For  example,  it  might  be  the  case  for  some  system  that  the  effect  of 
restarts  was  negligible  (due  to  extremely  high  cache-hit  ratios  on  restarts,  say),  but  that 
waiting  was  expensive  (due  to  context  swaps,  say)  -  after  having  determined  this,  poticy 
developers  and  maintainers  for  this  system  would  simply  concentrate  on  pofidee  that  avoid 
scheduling. 

A  more  far-reaching  conclusion  can  be  drawn  from  the  fact  that  the  performance  dffferenoee 
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Kfiono  ths  various  ooliciic  tMl6d  msiv  tosloniftoai^  vriwpii  0OMiisd  ^is>1lM-siiiiiitiMi^ 
differences  using  the  two  typse  of  ehared  memory.  The  implication  in  Sint  tong  term 
reeamrh  in  transaction  nmr—ilnn  avetem  desion  should  concentrate  more  on  iaumr>laiml 
systems  problems,  such  as  increasing  memory  and  communication  bandwidtha,  than  an 
concurrency  control  problems.  A  concurrency  control  policy  module,  on  the  other  hand,  can 
be  seen  as  a  maintenance  and  "tuning"  tool  that  is  moot  useful  after  a  transaction 
processing  system  has  bean  developed. 


8.6.  Further  Research 

Several  areas  for  further  research  can  be  identified.  Firm,  here  the  case  in  wMph  the 
concurrency  control  has  near-minimal  information  about  transactions  has  besn.^xbmined: 


the  concurrency  control  is  informed  only  of  the  10  of  each  object  aa  it  is  accessed,  and 
whether  it  is  a  read  or  write  access,  it  has  bean  argued  that  this  approach  works  weli  in 
most  transaction  processing  systems  (given  good  record  manager  design  and  concurrency- 
control-less  queries),  and  that  this  approach  has  the  advantage  of  separating  concurrency 
control  design  from  database  design.  Nevertheless  there  are  many  systems  in  which  this 


approach  is  unacceptable.  For  example,  some  databases  used  for  artificial  intelligence 
applications  consist  of  numerous  highly  interconnected  objects,  and  currently  it  does  not 
seem  possible  in  these  systems  to  maintain  global  consistency  with  small  independent 
transactions.  Also,  in  network  database  systems,  it  may  be  desirable  to  transfer  function 
requests  among  nodes  (e.g.,  "insert  tuple  T  in  relation  R")  instead  of  data.  Although  an 
access-driven  concurrency  control  could  be  used  at  each  node,  a  function-driven 
concurrency  control  could  prove  necessary  at  the  global  level.  A  problem  for  future 
research  then,  is  the  generalization  of  the  policy  approach  to  those  cases  in  which  additional 
information  is  available  to  the  concurrency  control. 


Another  issue  that  has  not  been  explored  here  is  the  manner  in  which  copies  of  objects  are 
handled  in  distributed  systems.  The  concurrency  control  design  titat  has  been  developed 
here  applies  directly  to  the  case  in  which  an  object  and  all  of  Ha  copies  are  identified  aa  a 
single  object,  and  it  also  applies  to  the  case  in  which  each  copy  is  considered  to  be  a 
distinct  object  In  the  tatter  case,  though,  the  concurrency  control  can  take  advantage  of  the 
knowledge  of  which  objects  are  copies.  This  approach  has  been  important  in  the 
development  of  robust  concurrency  controls  (e.g.,  the  voting  algorithms  of  [Thomas  79&. 
This  can  be  seen  as  another  example  of  a  case  in  which  it  is  desirable  to  make  additional 
information  available  to  the  concurrency  control. 


Next,  although  the  policy  module  approach  can  greatly  reduce  the  need  for  performance 
analysis  of  concurrency  controls,  H  certainty  does  not  eliminate  H.  For  example,  in  order  to 
automatically  switch  to  the  optimal  concurrency  control  method  baaed  on  performance 
monitoring  or  usage  statistics,  a  deeper  understanding  of  the  performance  chartctoriUca  of 
alternative  concurrency  control  methods  is  necessary. 
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FMy,  m  concurrency  oontroi  can  ba  used  to  maintain  consistency  whbe  a  SaassctUm, 
pro  raising  system  is  in  oparation,  but  a  racovary  subeystam  is  aacassapy  lb  rirtow  an 
inrnnaistsnt  rlstihsns  trr  a  m-iVt-4  t4—  Thus,  tha  concurrency  control  amt  rscovpry 
subayatam  a«a  in  tlHa  aanaa  aquaiy  important  Kara,  concurrency  and  racovary  problems 
have  been  almost  completely  isolated  through  the  road/writs  phaaa  mechanism:  in  the 
phase,  no  recovery  support  is  necessary,  ainca  no  shared  object  is  modified;  in  the 
phase,  no  shared  object  is  read,  and  so  no  concurrency  control  interaction  is  nocat 
with  the  exception  of  informing  the  concurrency  control  when  the  new  versions  of  ot 
written  in  the  write  phase  can  be  accessed  by  other  transactions,  ft  is  in  exactly  this 
that  concurrency  control  and  recovery  interact  For  example,  it  may  be  dasiraM 
recovery  reasons  not  to  make  new  versions  of  objects  accessible  by  other  transactions 
they  have  been  written  to  duplexed  disks,  say.  A  solution  is  to  assign  the  recovery 
subsystem  responsibility  for  informing  the  concurrency  control  when  new  versions  of  objeets 
are  accessible.  In  any  case,  alternatives  in  communication  between  recovery  subsystems 
and  concurrency  controls,  and  application  of  the  policy  approach  to  recovery  subsystem 
design,  are  problems  for  future  research. 
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Apptmiix  i.  Design  of  RecordManagifi 

As  notad  in  Chapter  1,  in  practice,  moet  transactions  are  conceptually  area*.  If  the  database 
is  organized  as  a  collection  of  pages  and  the  record  manager  can  toe  designed  as  that  a 
conceptual  small  transaction  la  uauaHy  physically  small  as  we*  (le.,  aooeessa  a  ami* 
number  of  pages),  then  concurrency  control  at  the  granularity  of  pages  w*  be  appropista. 

One  problem  in  designing  the  database  so  that  conceptually  small  transactions  are  usualy 
physically  sma*  is  that  various  search  structures  may  be  needed  in  order  to  support  efficient 
queries,  and  these  search  structures  will  have  to  be  periodically  updated  as  transactions 
modify  the  database.  In  order  to  see  how  a  record  manager  can  be  designed  so  that  search 
structures  are  kept  up  to  date  while  stfll  keeping  transactions  physicaHy  small,  in  this 
appendix  an  outline  of  a  record  manager  will  be  presented  as  an  example.  This  record 
manager  does  not  presuppose  any  particular  data  model;  in  practice,  any  given  data  modal 
would  be  supported  at  a  higher  record  management  level.  Note  that  this  is  just  an  example, 
and  that  many  details  have  been  omitted;  there  are  many  alternatives  in  record  manager 
design,  as  presented  in  numerous  textbooks  (e.g.*,  see  [WiederhoW  77],  [Date  77],  or  [UNman 
80])  and  elsewhere;  furthermore,  tills  is  a  problem  of  on-going  research. 

Although  there  are  many  data  models  that  can  be  used  at  the  conceptual  level  (the  most 
popular  being  relational,  hierarchical,  network,  and  entity-relationship),  ail  of  these  data 
models  can  be  realized  as  collections  of  Mas  of  records.  A  record  of  type  (type0,  type,, .... 
fyp«N1)  is  an  element  of  domainQ  x  domain ,  x  ...  x  domain Nv  where  each  type,  is  some 
primitive  type  (e.g.,  integer,  string,  etc.),  and  each  domain,  is  the  set  of  a*  values  of  type 
typar  A  Me  is  a  set  of  records  ail  of  the  same  type.  The  various  data  models  result  from 
decisions  on  whether  or  not  pointers  to  records  or  fUes  are  allowed  as  primitive  types,  and  if 
they  are  allowed,  restrictions  on  their  use. 

It  is  useful  for  a  variety  of  reasons  to  have  a  means  of  referring  to  existing  records  without 
referring  to  their  locations  (at  the  virtual  internal  level),  for  use  aa  record  pointers,  for 
example.  Therefore,  let  each  record  have  a  unique  record  ID.  These  can  be  generated  by 
the  Mname  facility  of  the  QMM.  The  advantage  of  not  retarring  explicitly  to  a  record's 
location  (in  terms  of  keeping  transactions  small)  is  that  records  may  be  movsd  for  storage 
allocation  purposes  without  requiring  a  large  number  of  pointer  modifications. 

The  basic  operations  defined  on  a  Me  are  Inserting  a  new  record,  deleting  an  old  record 
given  its  ID,  and  retrieving  a  record  given  its  ID  (assume  for  simplicity  that  a  record  update  is 
handled  by  a  deletion  of  the  record  to  be  updated  fofiowed  by  an  insertion  of  the  updated 
record,  however  keeping  the  old  record  10).  The  problems  now  are  (1)  to  find  space  in 
some  page  for  inserting  a  new  record,  (2)  to  reclaim  the  unused  apace  after  a  record  has 
been  deleted,  and  (3)  to  find  a  record  given  its  HD.  One  structure  that  solves  a*  of  those 
problems  nicely  is  the  B-traa  [Bayer  and  McCreight  72],  of  which  there  are  many  variants 
(see  [Comer  79]  for  a  survey).  Assuming  that  several  records  can  be  stored  per  page,  It  can 
be  used  in  this  case  toy  (1)  maintaining  an  indsx  of  record  locations  on  record  Os,  and 
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using  the  B-tree  storage  allocation  scheme  for  records,  that  Is,  treatinf  pe0se  oenMHSji 
records  as  the  leaf  pages  of  the  B-tree  structure.  With  respect  to  (2),  this  means  that 
storage  win  be  utilized  effectively  by  sometimes  redistributing  records  among  adleoent  pn0ss 
(adjacent  in  record  ID  space),  creating  or  deleting  pages  as  necessary.  In  the  case  that 
several  records  cannot  be  stored  per  page,  groups  of  pages  can  be  Hnkee  jther,  and  the 
group  treated  as  one  large  page,  simulating  the  simpler  case. 

The  B-tree  structure  has  the  properties  that  storage  is  utilized  effectively  (there  is  a 
guaranteed  minimum  utilization  of  50%,  66%,  or  more,  depending  on  the  variant,  with 
typically  much  higher  utilization),  that  few  pages  are  read  in  finding  a  record  given  its  10 
(typically,  3  or  4,  depending  on  page  size  and  number  of  records),  and  that  usuaffy  only  one 
page  is  modified  for  an  insertion  or  a  deletion  (in  the  case  that  several  records  can  be  stored 
per  page). 

Note  on  the  generation  of  record  IDs:  in  the  case  that  there  are  many  concurrent  insertions, 
it  is  not  desirable  that  newly  generated  record  IDs  be  close  in  record  ID  spece,  since 
otherwise  there  would  be  many  conflicts.  If  record  IDs  are  generated  by  using  the  current 
time  or  by  incrementing  a  counter,  this  could  be  a  problem,  and  a  scheme  to  handle  it  is  to 
index  the  record  with  ID  ID  under  F(/D),  where  F  is  some  "randomizing"  function,  and  than 
later  find  the  record  by  searching  for  F(ID).  A  simple  example  of  an  F 
is  f(all.12"*1+«B.a2"‘2 +...  +  «„)  »  «gS ^*1+e12"*a  that  is,  reverse  the  binary  digits  of 

the  record  ID. 

The  above  structure  is  all  that  is  needed  in  many  database  applications.  These  include 
some  network  and  hierarchical  applications  In  which  all  records  are  found  by  starting  horn  s 
root  record  and  then  following  pointers.  However,  if  a  pointer  to  a  record  is  not  avaflabis, 
finding  a  record  given  its  ID  is  not  a  particularly  useful  operation.  In  such  appticatio ns  (for 
example,  rotational  databases),  tecondiry  indexes  may  be  needed  for  query  support 

For  the  record  manager  described  below,  secondary  indexes  of  the  foRowing  kind  wfll  be 
supported. 

1.  There  are  index  records  of  the  form 

key0,  * ey, . *eyK.v  record  ID, 

where  key,  is  an  element  of  a  finite  totally  ordered  set  domain^  K  is  a 
constant,  and  record  ID  refers  to  a  record  (in  the  file)  with  theee  vetoes  as 
some  of  its  Balds. 

l 

2.  It  is  desired  to  retrieve  records  based  on  queries  of  the  form 

milt/  £  key,  £  maxr  0  £  /  £  K-1, 

i.e.,  rang t  queries. 

The  problem  of  designing  s  search  structure  for  this  type  of  secondary  index  thto  ftoi  admo 
of  toe  asms  properties  as  B- trees  has  been  jmestigstsd  In  (Rebtadtortl).  The  reesBtag 


structure  was  wand  ths  x-0-a-fre e  since  it  combines  properties  of  K-D-treee  (Bentley  7S) 
and  B- trees.  A  surrey  of  data  structures  for  raags  searching  appears  in  [Bentley  and 
Friedman  79J.  The  K-D-B-tree  structure  win  now  be  presented. 

Define  a  point  to  be  an  aiement  of  domain0  x  domain ,  x  ...  x  domain  KV  and  a  region  to 
be  the  set  of  al  points  (Vp  x1 .  xK1)  satisfying 

min,£x,<maxp  0  <J  /  jJK-1, 

for  some  coltection  of  minp  max,  €  domain r  Points  can  be  repreeented  most  simply  by 
storing  the  xp  and  regions  by  storing  the  min,  and  maxr 

Below,  it  will  be  required  that  certain  regions  be  disjoint,  and  that  their  union  be  a  region  -• 
thus  the  strict  inequality  on  the  right  hand  side  of  the  region  definition  above.  However,  it 
win  also  be  required  that  the  union  of  certain  regions  be  all  of  domainQ  x  domain  ^  x  ...  x 
domain K  V  It  is  therefore  necessary  to  create  for  each  domain  a  special  element  cop  which 
is  greater  than  all  elements  of  domain,,  and  to  allow  the  max,  to  assume  these  values.  It 
is  also  convenient  to  define  *00,  as  the  minimum  of  domain,. 

Like  B- trees,  K-D-B-treea  consist  of  a  collection  of  pages  and  a  variable  root  ID  that  gives 
the  page  ID  of  the  root  page.  There  are  two  types  of  pages  in  a  K-D-B-tree. 

1.  Region  pages:  region  pages  contain  a  collection  of  ( region ,  page  ID)  pairs. 

2.  Point  pages:  point  pages  contain  a  collection  of  (point ,  record  ID)  poire, 
where  record  ID  refers  to  a  database  record.  The  (point ,  record  ID) 
pair  is  in  tact  an  index  record. 

The  following  set  of  properties  define  the  K-D-B-tree  structure.  The  algorithm  for  range 
queries  given  below  depends  only  on  these  properties,  and  the  algorithms  for  insertions  and 
oeieuons  aro  oe&gnN  so  bs  10  prnorvs  ttism  propvniK. 

1.  Considering  each  page  as  a  node  and  each  page  ID  in  a  region  page  . as  a 
node  pointer,  the  resulting  graph  structure  is  a  multi-way  tree  with  root  root 
ID.  Furthermore,  no  region  page  contains  a  null  pointer,  and  no  region 
page  is  empty  (note  that  this,  together  with  the  tact  that  point  pages  do  not 
contain  page  IDs,  means  that  the  point  pages  are  the  leaf  nodes  of  the 
tree). 

2.  The  path  length,  in  pages,  from  root  page  to  leaf  page  is  the  same  for  afl 
leaf  pages. 


3.  In  every  region  page,  the 
a  region. 


regions  in  the  page  sre  disjoint,  and  their  union  is 


4.  if  the  root  page  is  a  region  page  (it  may  not  exist,  or  if  there  is  only  one 
page  in  the  tree  it  will  be  a  point  page),  the  union  of  its  regions  is  domain^ 


x  domain « 


x  domain. 


I 


s 
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5.  if  {region,  child  ID)  occurs  in  •  region  page,  and  the  eMM  page  reforeedta* 
by  child  ID  it  •  region  page,  then  the  union  of  the  regions  in  the  chM  page 
is  region. 

6.  Referring  to  (5),  if  the  chfld  page  is  e  point  pegs,  then  SR  the  points  in  the 
page  must  be  in  region. 

Figure  A1  illustrates  an  example  2-D-B-tree. 

A  range  query  can  be  expressed  by  specifying  a  region,  the  query  region,  it  is  convenient  to 
think  of  regions  as  a  cross-product  of  intervals  lg  x  1,  x  ...  x  IKr  if  some  of  the  intervale  of 
a  query  region  are  full  domains,  the  query  is  a  partial  range  query;  if  some  of  the  intervals 
are  points  and  the  rest  are  full  domains,  the  query  is  a  partial  match  query;  if  all  of  the 
intervals  are  points,  the  query  is  an  exact  match  query. 

The  algorithm  to  output  aH  records  satisfying  a  range  query  specified  by  query  region  la  as 
follows. 


Q1.  If  root  ID  is  the  null  page  ID,  terminate.  Otherwise,  let  page  be  the  root 

page. 

02.  If  page  is  a  point  page,  then  for  each  {point ,  record  ID)  pair  in 
page  with  point  a  member  of  query  region,  retrieve  and  output  the 
database  record  with  10  record  ID. 

03.  Otherwise,  for  each  (region,  child  ID)  pair  in  page  such  that  the 
intersection  of  region  and  query  region  is  non-empty,  set  page  to  be 
the  page  referred  to  by  child  ID,  and  recurse  from  (02). 

Next,  for  insertions,  it  is  necessary  to  define  the  splitting  of  a  region  along  element  x,  of 
domain r  Let  the  region  be  1q  x  I,  x  ...  x  IK1.  If  x,  <  I,,  the  region  is  not  changed  by 
splitting.  Otherwise,  let  I,  «  [min,,  max,);  splitting  the  region  reeuits  in  two  new  regions: 

1.  Ig  x  ...  x  [m/nr  x,)  x  ...  x  IK.V  , 

Z  IgX  ...  x  [xr  maXf)  x  ...  xIK>1. 

Region  (1)  fe  called  the  left  region  and  region  (2)  the  right  region.  If  x,  €  I}  sinoe  x,  <  mlnr 
the  region  is  said  to  Me  to  the  left  of  x^  It  x(  £  max,,  the  region  is  said  to  lie  to  the  right  of 
xr  A  point  (yp  yv  ...»  yK.t)  is  said  to  lie  to  the  left  of  x,  if  y,  <  xp  and  to  the  right 
of  x,  otherwise. 

A  point  page  is  apfit  along  x,  by  creating  two  new  point  pages,  the  left  page  end  the 
page;  then  transferring  aN  the  {point ,  record  ID)  pairs  in  the  page  to  either  the  left  or 
page  depending  on  whether  point  lies  to  the  left  or  the  right  of  xf,  and  then  deleting  th 


I 
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51.  H  region  lies  to  the  left  of  x;,  add  (region,  page  ID)  to  the  left 
P*Q* 

52.  If  region  Hea  to  the  right  of  x/t  add  ( region ,  page  ID)  to  the  right 

page. 

53.  Otherwiee:  . 

53.1 .  Split  the  page  referenced  by  page  ID  along  x,,  resulting  in  pages  with  IDs 
left  ID  and  right  ID. 

53.2.  Split  region  along  x,,  resulting  in  regions  left  region  and  right  region. 

53.3.  Add  (left  region,  left  ID)  to  the  left  page,  and  (right  region,  right  ID)  to  the 
right  page. 

Note  that  this  procedure  is  recursive  due  to  (S3.1). 

The  algorithm  for  inserting  an  index  record  (point,  record  ID)  is  as  follows. 

11.  If  root  ID  is  null,  create  a  point  page  containing  (point,  record  ID),  set 
root  ID  to  the  ID  of  this  page,  and  terminate. 

12.  Otherwise,  do  an  exact  match  query  on  point,  which  finds  a  point  page 
that  point  should  be  added  to  if  the  K-D-B-tree  structure  is  to  be 
preserved.  If  point  is  already  to  the  page,  do  something  special  (like 
generating  an  error,  or  modifying  pointer  fields  to  existing  database 
records),  and  terminate. 

-  jT  .  " 

13.  Add  (point,  record  ID)  to  the  point  page.  If  the  page  does  not  overflow, 
terminate.  Otherwise,  let  page  be  the  point  page. 

14.  Let  toe  >D  of  page  be  old  ID.  Pick  a  domain,  domain,,  and  an  element 
x,  in  this  domain,  such  that  page  split  along  x,  will  result  in  two  pages 
that  are  not  overfull  (since  the  number  of  points  or  regions  in  page 
need  only  be  decreased  by  one  to  avoid  overflow,  it  is  easy  to  see  that 
this  is  always  possible).  Split  page  along  xr  giving  left  and  right  pages 
with  IDs  left  ID  and  right  ID. 

15.  If  page  was  the  root  page,  go  to  (18).  Otherwise,  let  page  be  the 
parent  page  of  page  (this  parent  page  was  found  during  the  exact 
match  query  step  above).  Replace  (region,  o Id  ID)  to  page 
with  (left  region,  left  ID)  and  (right  region,  right  ID),  where  left  region 
and  right  region  are  obtained  by  splitting  region  along  x,.  If  this  causes 
page  to  overflow,  repeat  from  (14);  otherwise  terminate. 
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16.  Create  s  new  region  page  containing  the  regions 

( domainQ  x  ...  x  [-oo( ,  x,)  x  ...  x  domain K, ,  left  ID  ), 

( domainQ  x  ...  x  [x, ,  oo, )  x  ...  x  domain  K,  ,  right  ID ), 
and  set  root  ID  to  its  ID. 

Variations  of  the  above  algorithm  result  from  the  way  domain,  and  x,  are  chosen  in  (14).  One 
way  of  choosing  domain,  is  to  do  so  cyclically,  as  follows.  Store  in  each  page  a  variable 
splitting  domain,  initialized  to  0  in  a  root  page  when  a  new  root  page  is  created.  When  a 
page  splits,  an  element  of  domain ^Mpmng  aomain  is  used,  and  the  new  pages  have  splitting 
domain  set  to  ( splitting  domain  +  1)  moo  K.  This  method  is  analogous  to  the  cyclic  choice 
of  domains  in  K-D-trees  (see  [Bentley  75]).  Exceptions  to  this  procedure,  as  well  as  other 
techniques  for  choosing  domain ,  and  x,,  are  discussed  in  [Robinson  81]. 

Since  the  K-D-B-tree  structure  does  not  preclude  empty  point  pages,  and  has  no  minimum 
storage  utilization  requirements,  the  basic  deletion  algorithm  is  very  simple:  find  the  index 
record  (point ,  record  ID)  with  an  exact  match  query,  and  remove  (point,  record  ID)  from  the 
point  page. 

Unless  there  are  very  few  deletions,  or  by  chance  insertions  take  place  that  "fill  in  the  holes” 
left  by  deletions,  this  basic  deletion  algorithm  will  be  unacceptable  due  to  the  resulting  low 
storage  utilization.  In  B-tree  algorithms,  this  problem  is  solved  by  what  are  here  considered 
to  be  reorganization  techniques.  This  reorganization  takes  place  by  redistributing  index 
records  among  two  or  more  adjacent  sibling  pages.  The  same  type  of  reorganization  can  be 
performed  on  K-D-B-trees,  providing  the  notion  of  adjacency  can  be  generalized  to  more 
than  one  dimension. 

One  way  to  generalize  adjacency  is  as  follows:  if  the  union  of  two  or  more  regions  is  a 
region,  the  regions  are  said  to  be  joinatHe.  Using  this  property,  an  outline  of  the  algorithm  to 
"reorganize  page  P"  is  as  follows  (P  could  be  an  underfull  point  page  produced  by  a 
deletion,  or  an  underfull  region  page  produced  by  previous  reorganization). 

1.  Let  page  be  the  parent  page  of  P,  containing  (region,  ID),  where  ID  refers 
to  P, 

2.  Find  (region,,  ID,),  { region v  ID2),  ....  in  page  such  that  region,  region ,, 
region ^  ....  are  joinabie  (this  is  always  possible  ••  in  the  worst  esse,  this  win 
be  all  the  regions  of  page). 

3.  Catenate  tire  pages  with  IDs  ID,  IDV  ID2,  ....  and  then  repeatedly  split  this 
page  and  resulting  pages  until  no  page  is  overfull 

4.  Replace  (region,  ID),  (region,,  ID,),  (t region r  IDJ,  ....  in  page  with  the 
resulting  new  regions  and  page  IDs. 
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5.  (f  page  is  the  root  page,  and  it  now  contains  only  one  pair  ( region ,  ID), 
delete  page  and  set  root  ID  to  ID. 

Another  possible  use  of  reorganization  is  during  insertions,  since  step  (S3)  can  leave  empty 
or  near-empty  point  pages.  This  should  probably  be  done  only  at  the  point  page  level,  since 
reorganization  itself  makes  use  of  step  (S3)  when  performed  at  higher  levels.  However, 
almost  ail  pages  are  point  pages  (see  the  table  below).  Reorganization  strategies  for  K-D-B- 
trees,  and  the  performance  of  K-D-B-trees  under  reorganization,  are  subjects  of  current 
research. 

A  major  difference  between  K-D-B-trees  and  B- trees  with  respect  to  insertions  is  step  (S3), 
which  forces  pages  at  lower  levels  to  split  even  though  they  are  not  overfull.  An  immediate 
question  is  how  badly  step  (S3)  affects  performance,  in  terms  of  storage  utilization  and  page 
accesses.  Surprisingly,  the  performance  of  K-D-B-trees  is  quite  good  in  spite  of  step  (S3), 
even  without  reorganization.  Table  B  shows  the  insertion  characteristics  of  2-D-B- 
trees  and  3-D-B-trees,  without  reorganization,  and  with  index  records  randomly  generated, 
uniformly  distributed  in  K- space.  Details  of  these  and  other  experiments  appear  in  [Robinson 
81]. 

Including  various  secondary  indexes,  as  desired,  an  example  of  the  resulting  file  structure  is 
shown  in  Figure  A2.  A  problem  of  future  research  is  the  optimal  choice  of  multidimensional 
secondary  indexes,  given  some  kind  of  characterization  of  the  "average"  query. 


PAGES 


PAGE 

PAGES  AT  EACH 

STORAGE 

ACCESSED/ 

K  SIZES* 

SIZE 

LEVEL® 

UTILIZATION 

INSERTION0 

2  25,42 

20.000 

1,  2,  40.  714 

0.66 

1.09,348 

40,000 

1;  4,  80, 1488 

006 

1.00,4.00 

80,000 

1.  7,122.2187 

006 

1.13,440 

80000 

1,  9,166,2904 

0.65 

1.18,4.00 

100000 

1. 12.200,3882 

044 

1.18.440 

3  38,63 

20000 

1,  20,  514 

0.61 

1.16,242 

40000 

1,  2,  45, 1060 

049 

1.16,340 

soooo 

1.  2,  63, 1547 

0.61 

1.16,441 

80000 

1,  4,  69,2064 

0.60 

1.18,441 

100,000 

1,  4, 108,2604 

0.60 

1.15,440 

^  Piqc  sixes  *  R,  P,  whsv  R  is  maximum  number  of  regi ions  In  a 
region  page,  P  is  maximum  number  o*  poirts  in  a  point  pee*. 

®  For  example,  "1, 20, 514”  means  1  page  at  level  i  (root  page),  20 
pagsa  at  level  2,  and  514  pages  at  level  3  (point  pagaa). 

0  Pages  accessed  ■  W,  R,  where  W  is  pages  written,  R  Is  pages  read, 
averaged  over  20,000  Insert  one. 


Table  B.  Drawing  K-D-B-Trees 
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File  Descriptor  Pago 


record 

record 

recort 

record  I 

recorc 

record 

recorc 

ooo 

rocoro 

record 

••• 

i  - 

Figure  A2.  Example  File  Structure 

By  introducing  record  types  that  contain  pointers  to  the  descriptor  pages  of  existing  Noe, 
directories  of  files,  etc.,  can  be  built,  resulting  in  *.  hierarchical  structure  Ike  that  of  the 
record  manager  used  in  the  Cm*  system,  for  example. 
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It  might  be  thought  that  the  introduction  of  new  secondary  indexes  would  require  a 
reorganization  of  the  file,  that  is,  a  large  transaction,  but  even  this  can  be  avoided,  assuming 
newly  generated  record  IDs  are  monotonicaMy  increasing.  With  this  assumption,  one  method 
for  solving  the  problem  is  as  follows. 

1.  Create  the  new  secondary  index,  and  index  every  record  inserted  to  the  Me 
from  this  point  on  through  this  index,  as  usual.  However,  mark  the  index 
as  not  being  up  to  date,  so  that  queries  wiH  avoid  its  use. 

2.  Let  min  ID  and  max  ID  be  the  minimum  and  maximum  record  10  in  the  Me 
at  the  time  the  new  secondary  index  is  created.  Start  a  process  that 
repeatedly  finds  the  next  record  ID  in  the  file,  from  min  ID  to  max  ID,  and 
that  indexes  each  corresponding  record  in  the  new  secondary  index,  with 
each  indexing  operation  implemented  as  a  separate  transaction. 

3.  When  this  process  terminates,  mark  the  index  as  being  up  to  dale. 

The  deletion  of  an  existing  secondary  index,  does  not  cause  significant  concurrency 
problems,  since  once  the  pointer  to  the  index  is  removed  from  the  file  descriptor  page,  all 
future  transactions  or  queries  cannot  access  the  index.  The  process  of  deleting  all  pages  of 
the  secondary  index  can  then  be  performed  with  transactions  of  a  size  chosen  without 
regard  to  conflicts. 

This  concludes  the  outline  of  a  possible  record  manager. 
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Appendix  il.  Concurrency  Control  Algorithms 

In  this  appendix  concurrency  control  algorithms  are  presented,  in  order  to  make  these 
algorithms  available  to  a  wider  audience,  the  Cm*  CC  subsystem  was  re- prog  rammed  in 
Pascal,  following  the  original  Bliss  implementation  fairly  closely.  The  intent  of  the  Pascal 
program  below  is  to  dearly  show  the  logical  nature  of  the  algorithms,  avoiding  distracting 
complexity.  Thus,  sets  of  object  IDs  are  declared  as  Pascal  sets,  even  though  the  usual  bit- 
vector  representation  would  in  practice  be  unacceptable  (for  example,  16-bit  object  IDs  imply 
bit-vectors  of  length  64K  bits  to  represent  object  ID  sets);  in  practice,  these  sets  would  be 
represented  as  linked  lists  or  stacks  of  object  IDs  (a  linked  list  representation  was  used  in 
the  Cm*  CC  subsystem).  Similarly,  structures  that  are  logically  associatively  accessed  by 
object  ID  are  declared  as  arrays  indexed  by  object  ID  (these  are  RSet,  RPSet,  WSet,  and 
WPSet  below);  in  practice,  only  information  for  object  IDs  that  were  currently  in  use  would 
be  stored.  In  die  Cm*  CC  subsystem,  for  example,  information  for  any  given  object  ID  was 
accessed  through  a  hash  table,  entries  of  which  pointed  to  a  linked  list  of  object  ID  records 
with  identical  hash  values.  New  object  ID  records  were  created  as  necessary,  and  when  all 
transaction  processor  sets  in  a  particular  object  ID  record  became  empty,  the  record  was 
deleted. 

In  order  to  test  the  Pascal  implementation,  terminal  I/O  was  used  in  place  of  what  had  been 
message  sending  and  receiving,  and  a  procedure  to  print  transaction  records  was  added. 
All  of  this  has  been  left  intact,  and  as  an  aid  to  understanding  the  program,  an  example  of 
the  execution  of  the  program  follows  the  program  listing. 

Note:  the  Pascal  variant  used  was  IBM's  Pascal/VS;  the  only  occurring  differences  from 
Pascal  as  described  in  [Jensen  and  Wirth  74]  are  the  "otherwise''  construct  in  the  procedure 
that  reads  an  input  line,  and  the  terminal  I/O  initialization  procedures. 

The  program  and  example  fotow. 
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{ 


CONCURRENCY  CONTROL 


-} 


program  CC (input,  output); 


{  for  illustrative  purposes,  transaction  processors  are  named  ) 
{  *1*,  '2',  ....  '9',  and  objects  are  named  'A',  'B',  ....  '2'.  ) 

const 

MinTPID  «  MaxTPID  «  '9'; 

MinOID  *  'A';  MaxOID  «  'Z*; 


{ 


TYPES 


) 


typo 

TPID  *  MinTPID  . .  MaxTPID;  { 

OID  *  MinOID  . .  MaxOID;  { 

TSet  «  sot  of  TPID;  { 

OSet  «  sot  of  OID;  { 

AccType  *  (none,  R,  RW,  W,  V);  ( 

DecType  =  (kill,  wait,  die,  grant);  { 

SubOpType  *  (rdrs,  wrtrs,  all);  { 

StatType  ■  (active,  pending,  { 

validated,  aborted,  coaipleted); 

MsgType  *  (Cbegin,  Cread,  Cwrite,  { 


Cend,  Cvalid,  Cabort,  Cpolicy,  Clook, 


transaction  processor 
object  ID  } 
transaction  set  ) 
object  set  } 
access  types  ) 
decisions  ) 
sub-options  } 
status  ) 

messages  } 

Cquit ) ; 


ID 


) 


TRec  *  record 

status:  StatType; 
access:  AccType; 
ObjID:  OID; 
WaitCount :  integer ; 
PrecedeSet , 

VwaitSet , 

CwaitSet, 

ReferSet : 

TSet; 

Ob j Set:  OSet 

end; 


{  transaction  record  } 

<  transaction  status  } 

{  type  of  most  recent  request  } 

{  obj .  ID  for  most  recent  access  request  ) 

<  number  of  trans . ' s  being  waited  on  } 

{  set  of  trans.' t  ->  this  trans.  ) 

{  this  trans .  «>V  set  of  trans . ' s  ) 

{  this  trans.  *>C  set  of  trans. 's  ) 

{  set  of  trans. 's  that  refer  in  any  ) 

{  fashion  to  this  trans.  ) 

{  set  of  objects  for  which  this  trans.  ) 

(  has  requested  access  ) 


Concurrency  Control  Alqomthmb 
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{ . . GLOBAL  VARIABLES . - — } 

var 

INC:  integer ;  {  transaction  number  counter  ) 

RunSet:  TSet;  {  set  of  runnable  transactions  } 

CurrentTP:  TPIO;  {  current  transaction  processor  ID  ) 

trans:  array  [TPID]  of  TRec;  {  transaction  records  } 

RSet,  {  read  sets  ) 

RPSet,  {  read  postponed  sets  > 

VSet,  {  write  sets  ) 

VPSet:  {  write  postponed  sets  } 


array [OID]  of  TSet; 


PM  VARIABLES 


Roption, 

RWoption, 

Woption, 

Voption: 

DecType; 
Rtf SubOpt ion: 


SubOpType ; 


{  read  option  > 

{  read/write  option  } 

{  write  option  } 

{  validation  option  } 

{  read/write  sub-option  } 


/ 

\ 

1 

'i  ; 


f 


! 
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{ . INITIALIZATION . . . )  ' 

procedure  init; 

ver  tp:  TPID;  id:  OID; 

begin 

INC  :•  1; 

RunSet  :*  []; 

Currant!?  :*  MinTPID; 
for  tp  :■  MinTPID  to  MaxTPID  do 
with  trans[tp]  do  begin 
statue  :*  completed; 
access  :»  none; 

ObjID  :«  MinOID; 

VaitCount  :*  0; 

PracadaSat  :«  []; 

VwaitSat  :■  []; 

CwaitSat  :»  [j; 

ReferSat  :*  [j; 

Ob j Set  :*  [] 

end; 

for  id  :«  MinOID  to  MaxOID  do  begin 


RSet[ id] 

:»  11; 

RPSet [id] 

:»  (]; 

VSet[id] 

:»  U; 

WPSet [id] 

:*  [] 

end 

end; 


1 


i 


I 


{ . - . COMPUTE  UAH  RELATION . . ) 

{  deteraine  if  trans.  on  tpl  in  waiting  on  trasa.  on  tp2  } 

function  WaitingOn (tpl,  tp2:  TPXD):  boolean; 
label  1{ return); 
var  tp3:  TPID; 

begin 

WaitingOn  :»  falsa; 
with  trans[tp2]  do  begin  . 

if  tpl  in  (VwaitSat  +  CwaitSat) 
than  WaitingOn  :»  trua 
also  for  tp3  :•  MinTPID  to  MaxTPID  do 
if  tp3  in  (VWaitSat  +  CwaitSat) 
than  if  WaitingOn (tpl,  tp3) 

than  begin  WaitingOn  :*  trua;  goto  l{return)  and 

and; 

1: (return) 

and; 


{ . SET  POLICY  . ) 

procedure  PMpolicyCRop,  RVop,  Wop,  Vop:  DecType;  RVSubOp:  SubOpType); 

begin 

Roption  :■  Rop;  RWoption  :*  RVop;  Woption  :*  Wop;  Voption  :«  Vop; 
RVSubOption  :*  RVSubOp; 

if  Voption  *  grant  than  Voption  :*  kill  {  grant  is  an  illegal  Vop  ) 
and; 
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OMIONOPCONCUI«EMCyCONTaO>a<W^ 


POLICY  MODULE 


{  The  following  function  decides  how  to  handle  a  request  from  ) 

{  transaction  processor  tp.  ConflSet  is  the  set  of  possibly  > 

{  conflicting  transactions,  and  VaitSet  and  AbortSet  will  be  } 

{  set  to  the  sets  of  transactions  to  wait  on  or  abort.  The  ) 

{  result  of  the  function  is  the  decision  for  tp.  ) 

function  PMdecide(tp:  TPID;  ConflSet:  TSet; 

var  VaitSet,  AbortSet:  TSet):  DecType; 
var 

tpi:  TPID;  decision:  DecType; 
begin 

with  trans[tp]  do  begin 
case  access  of 

R:  decision  :B  Roption; 

RW:  decision  :B  RVoption; 

V:  decision  :*  Woption; 

V:  decision  :*  Voption  end; 

if  (access  *  RW)  and  ((RVoption  B  kill)  or  (RVoption  B  wait)) 


if  (access  *  RW)  and  ((RVoption  B  kill)  or  (RVoption  B  wait)) 
then  case  RWSubOption  of 
all:  ; 

rdrs:  ConflSet  :B  ConflSet-(WSet[0bjID]4VPSett0bjID] ) ; 
wrtrs :  ConflSet  :■  ConflSet*(VSet[0bjID]4VPSet(0bjIDJ) 

end; 

if  decision  *  wait  then  begin  {  check  if  deadlock  would  result  } 
if  access  *  V 

then  {  in  order  to  allow  queueing  of  requests  ) 

ConflSet  :B  ConflSet  - 

((VwaitSet+CwaitSet)  *  RPSet[ObjIDJ ) ; 
for  tpi  :•  MinTPID  to  MaxTPID  do 
if  tpi  in  ConflSet 

then  if  WaitingOn(tpi,  tp) 

then  decision  :B  die  {  default  victim  is  requestor  ) 

end; 

if  ConflSet  B  []  then  decision  :■  grant; 

VaitSet*  :«  [];  AbortSet  :B  []; 
case  decision  of 
grant :  ; 

kill:  AbortSet  :■  ConflSet; 
wait:  VaitSet  :B  ConflSet; 
die:  end 

end; 

PMdecide  :B  decision 

end; 


1 


{ 


CC-TP  COMMUNICATION 


{  the  following  two  procedures  would  in  practice  > 

{  each  sand  a  ms sage  to  a  transaction  processor  ) 

procedure  SendResult(tp:  TPID;  result:  boolean); 

begin 

write ('■■■>  Message  to  TP',  tp,  * :  '); 

if  result  then  writeln('OK')  else  writeln( ' ABORT' ) 

end; 

procedure  SendTNCtp:  TPID;  tn:  integer); 

begin  writeln('«M>  Message  to  TP',  tp,  OK,  TN  *  ' ,  tn:l)  end; 


{  the  following  procedure  would  in  practice  ) 
{  read  a  message  from  the  CC  input  pipe.  ) 
{  Example  terminal  input:  ) 
{  q  quit  ) 
{  bl  begin  transaction  on  TP1  ) 


{  r2B  request  for  read  access  from  TP2  for  B  } 

{  pGGaGK  change  policy  to  optimistic  method  ) 

procedure  6etMsg(var  m:  MsgType;  var  tp:  TPID;  var  id:  OID; 

var  Rop,  RWop,  Wop,  Vop:  DecType;  var  RWSubOp:  SubOpType); 

const  MaxLnth  »  6; 

var  line:  array [ 1 . .MaxLnth]  of  char;  i:  1.. MaxLnth;  bad:  boolean; 

procedure  SetMsgCc:  char); 
begin  case  c  of 

'b':  m:*Cbegin;  'r':  m:*Cread;  'w':  m:»  Cwrite; 

'v':  m:«  Cvalid;  'e':  m:*Cend;  'a':  m:»  Cabort; 

’p':  m:«  Cpolicy;  '1':  m:«  Clook;  ’q':  m:*Cquit; 

otherwise  bad:>true  end  end; 
procedure  SetTPID(c:  char); 

begin  if  (c  2  MinTPID)  and  (c  S  MaxTPID) 
then  tp:«c  else  bad:«true  end; 
procedure  SetOIDCc:  char); 

begin  if  (c  2  MinOID)  and  (c  £  MaxOID) 
then  id:«c  else  bad:«true  end; 
procedure  SetOp(c:  char;  var  o:  DecType); 
begin  case  c  of 

'K':  o:«kill;  'W' :  o:»wait;  'D':  o:«die;  'G':  o:*  grant; 
otherwise  bad  :■  true  end  end; 
procedure  SetSubOp(c:  char;  var  so:  SubOpType); 

begin  case  c  of 

'r':  so:«rdrs;  'w' :  so:^rrtrs;  'a':  so:»all; 
otherwise  bad  :■  true  end  end; 


begin  . 

repeat 

vcitila('(ttt«r  unlit)1); 

read(line(l]); 

for  i  :■  2  to  MaxLnth  do 

if  eoln  thon  linefi]  :»  '  '  eiae  read(line{i]); 
readln; 
bad  :■  falsa; 

SetMsg(line(l]H 
if  not  bad  thon  begin 
if  ■  *  Cpolicy 
thon  begin 

Set0p(line[2] ,  Rop);  Set0p(line(3] ,  RWop); 
SetSub0p(line[4] ,  RWSubOp);  Set0p(line[5] ,  Wop); 
Set0p(line(6] ,  Vop)  ond 
oiso  if  ■  i  Cquit 
thon  bogin 

SetTPID(line[2]); 

if  (o-Cread)  or  (o-Cwrite)  thon  SatOID(lina[3] )  ond 

ond; 

if  bad  thon  writeln( *  (bad  Input,  tty  again)') 
until  not  bad; 

if  m  *  Cpolicy 

thon  bogin 

writ«(' Policy  change  Message:  NEW  POLICY  *'); 
for  i  :»  2  to  MaxLnth  do  writeC  line[i]); 
vriteln  ond 

olso  if  (n  *  Cquit)  and  (a  #  Clook) 

thon  bogin 

write ('Message  fro*  TP',  tp,  '); 

case  o  of 

Cbegin:  writ • In ( 'BEGIN' );  Cread:  writeln( 'READ  ',id); 
Cwrite :  write In (’WRITE  ’,id);  Cvalid:  writeln( 'VALIDATE'); 
Cend:  writeln( 'END' ) ;  Cabort:  writeln(’A10RT' )  ond  ond 
oiso  if  a  ■  Cquit 

then  writelnC (exit)') 


CONCUfWENCV  CONTROL  ALOOWTMMS 


{ . SCHEDULING 


{  have  tpl  wait  on  tp2  } 

procedure  schedule(tpl,  tp2:  TPID); 

begin 

with  trans[tpl]  do  begin 
if  (access  »  R)  or 

((access  -  RW)  and  (tp2  in  (WSet[0bJID]4VPSet[0bjID]))) 

than 

trans[tp2] .CwaitSet  :■  trans [tp2] .CwaitSet  +  [tpl] 

alsa 

trans [tp2] .VwaitSet  :«  trans [ tp2 ] .VwaitSet  +  [tpl]; 
UaitCount  :*  WaitCount+1 ; 

ReferSet  :*  ReferSet  +  [tp2] 

and 

and; 


{  postpone  transaction  } 

procedure  postpone (tp:  TPID); 

begin 

with  trans  [tp]  do  begin 
if  access  *  V 

than  status  :*  pending 

alsa  begin 

if  (access  «  R)  or  (access  »  RW) 

then  RPSet[ObjID]  :•  RPSet[ObjID]  +  [tp] ; 
if  (access'  *  V)  or  (access  *  RW) 

then  WPSet[ObjID]  :■  WP8et[0bjID]  +  [tp] ; 
Ob j Set  :■  ObjSet  +  [ObjID] 
and 
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{ 


MAINTAIN  PRECEDE  RELATION 


} 


procedure  precede (tpl,  tp2:  TPID); 

begin 

with  trens[tpl]  do  begin 

if  (access'  *  R)  or  (access  *  Rtf) 


then  if  tp2  in  VSet[ObjID] 
then  begin 

trans[tp2] .PrecedeSet  :*  trans [tp2] .PrecedeSec  +  [tpl]; 

ReferSet  :■  ReferSet  +  [tp2] 

end; 

if  (access  *  W)  or  (access  «  RW) 
then  if  tp2  in  RSet[ObjID] 

then  begin 

PrecedeSet  :«  PrecedeSet  +  [tp2] ; 

trans [tp2] .ReferSet  :*  trans [tp2] .ReferSet  +  [tpl] 

end 


end 


end; 


{ . GRANT  A  REQUEST . > 

procedure  GrantReq(tp:  TPID); 

begin 

with  trans  [tp]  do  begin 

if  (access  ■  R)  or  (access  *  RW) 

then  begin 

RPSet[ObjID]  RPSet[ObjID]  -  [tp]; 

RSet[Obj ID]  :»  RSet[ObjID]  +  [tp] 

end; 

if  (access  *  W)  or  (access  “  RW) 
then  begin 

WPSet[ObjID]  :»  '-fPSet[ObjID]  -  [tp]; 

WSet[Obj ID]  :■  WSeutObjID]  +  [tp] 

end; 

Ob j Set  :«  Ob j Set  +  [ObjlD] 

end; 

SendResult (tp ,  true) 

end; 
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{ . ABORT  A  TRANSACTION . > 

procedure  abort  (tp:  TPID) ; 

V»r  tpi:  TPID;  id:  OID; 

bogin 

with  trass  [tp]  do  bogin 
status  :*  aborted; 
for  tpi  :*  MinTPID  to  MaxTPID  do 

bogin 

if  tpi  in  (VvaitSet  +  CwaitSet) 
than  with  trass [tpi]  do  bogin 
VaitCoust  :■  WaitCount- 1; 

if  WaitCount  *  0  than  RunSet  :=  RusSct  +  [tpi] 

and; 

if  tpi  in  ReferSet  than 

with  trass  [tpi]  do  bogin 

PrecadeSet  :«  PrecedeSet  -  [tp]; 

VwaitSet  :«  VwaitSet  -  [tp] ; 

CwaitSet  :*  CwaitSet  -  [tp] 

and 

and; 

PrecedeSet  :»  [];  VwaitSet  :=  [];  CwaitSet  :=  [];  ReferSet  :«  [] 
WaitCount  :*  0; 
for  id  :*  MinOID  to  MaxOID  do 
if  id  in  Ob j Set  than 
bogin 

RSet[id]  :*  RSet[id]-[tp] ;  RPSet[id]  :*  RPSet[id] -[tp] ; 
VSet [ id]  :■  WSet[ id] - [tp] ;  WPSet[id]  :•  WPSet[id]-[tp] 

and; 

Ob j Set  :*  [] 

and 
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{ . - . VALIDATE  A  TRANSACTION . > 

procedure  validate(tp:  TPID); 

V«r  tpi:  TPID;  id:  OID; 

begin 

with  trans[tp]  do  begin 
status  :*  validated; 
for  tpi  :*  MinTPlD  to  MaxTPID  do  begin 
if  tpi  in  VwaitSet 

then  with  trans[tpi]  do  begin 
WaitCount  :*  WaitCount-1; 

if  WaitCount  *  0  then  RunSet  :=  RunSet  +  [tpi] 

end; 

if  tpi  in  ReferSet 

then  with  trans[tpi]  do  begin 

PrecedeSet  :=  PrecedeSet  -  [tp]  end 

end; 

for  id  :=  MinOID  to  MaxOID  do 
if  id  in  Ob j Set 

then  RSet[id]  :*  RSetjid]  -  [tp] 

end; 

SendTN(tp,  TNC); 

TNC  :=  TNC+1 

end; 


{ . COMPLETE  A  TRANSACTION . } 

procedure  complete(tp:  TPID); 
var  tpi:  TPID;  id:  OID; 

begin 

with  trans[tp]  do  begin 
status  :»  completed; 
for  tpi  :■  MinTPlD  to  MaxTPID  do 
if  tpi  in  CwaitSet 

tijen  with  trans[tpi]  do  begin 
WaitCount  :■  WaitCount -1; 

if  WaitCount  *  0  then  RunSet  :*  RunSet  +  [tpi] 

end; 

PrecedeSet  :«  ( ] ;  VwaitSet  :■  [];  CwaitSet  :«  (];  ReferSet  :*  []; 
WaitCount  :■  0; 
for  id  :«  MinOID  to  MaxOID  do 
if  id  in  ObjSet 

then  WSetJid]  :-  WSet[id]  -  [tp]; 

ObjSet  :«  [] 
end 
end; 
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“If 

Ullk 


MU 
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{ - PROCESS  A  REQUEST  - - 

procedure  process (tp:  TPID;  ConflSet:  TSet); 

var  decision:  DecType;  VaitSet,  AbortSet :  TSet ;  tpi:  TPID; 

begin  * 

{  lot  the  policy  determine  boa  to  process  this  request  } 

I  decision  :*  PMdecide(tp,  ConflSet,  VaitSet,  AbortSet); 

{  check  if  the  policy  decided  to  abort  a  validated  transaction,  ) 

{  or  if  the  policy  decided  to  grant  the  request  even  though  there  ) 

{  was  a  validated  conflicting  transaction  ) 

for  tpi  :*  MinTPID  to  MaxTPID  do 
if  trans I tpi] .status  ■  validated 
than  if 

(tpi  in  AbortSet)  or 

(  ((decision  *  grant)  or  (decision  *  kill))  and 
(tpi  in  ConflSet)  ) 

than  begin  decision  :*  die;  AbortSet  :■  []  and; 

{  abort  transactions  in  AbortSet  ) 
for  tpi  :»  MinTPID  to  MaxTPID  do 

if  tpi  in  AbortSet  than  abort(tpi); 

{  now  process  the  request  according  to  decision  } 
case  decision  of 

die:  begin  abort (tp);  SendResult(tp,  false)  and; 
wait:  begin 

for  tpi  :*  MinTPID  to  MaxTPID  do 

|  if  tpi  in  VaitSet  than  sehedule(tp,  tpi); 

i  I  postpone (tp) 

I  and; 

^  !  grant , 

■j  I  kill:  if  trans  [tp]  .access  ■  V 

than  vsiidate(tp) 

!  !  else 

I  t  begin 

i  for  tpi  :■  MinTPID  to  MaxTPID  do 

i  if  tpi  in  ConflSet  than  precede(tp,  tpi); 

GrantReq(tp) 

I  and 

and 

1  and; 

i  ! 


I 

( 


.  e*r 


{ . . . -  nm  cc  REQUESTS - - — - - •) 


j  procadun  CCprocess; 

var  ConflSet :  TSet; 
begin 

{  repeatedly  process  requests  while  RunSet  is  non-e«pty  } 
whila  RunSet  *  I]  do  begin 

j  whila  hot  (CurrentTP  in  RunSet)  do 

if  (Current!?  <  MexTPID) 

than  CurrentTP  :«  succ(CurrentTP) 
aisa  CurrentTP  :*  MinTPID; 

with  trans[ CurrentTP]  do  begin 

{  determine  set  of  possibly  conflicting  transactions  } 
case  access  of 

R:  ConflSet  :«  USet[ObjID]  +  VPSet[ObjIO]; 

RW:  ConflSet  :«  RSet[ObjID]  +  RPSet[ObjID]  + 
VSet[ObjID]  +  WPSetlObJID]; 

V:  ConflSet  :•  RSet[ObjID]  +  RPSetfObjlD] ; 

V:  ConflSet  :*  PrecedeSet  and; 

{  current  transaction  can't  conflict  with  itself  > 
ConflSet  :*  ConflSet  -  [CurrentTP]; 

I 

{  now  process  the  request  > 
if  ConflSet  *  [] 

than  bagin  {  process  here  to  save  some  tine  ) 
if  access  «  V 

than  validate (CurrentTP) 
a  aisa  GrantReq(CurrentTP) 

and 

•>  aisa  process  (CurrentTP,  ConflSet); 

{  finished  now  with  current  transaction  ) 

RunSet  :■  RunSet  -  [CurrentTP] 

and 

and 

and; 


{ . CC  PROCEDURES - - - ) 

{  Suppose  it  was  desired  to  use  a  tieestaep-besed  pplicy.  Than  v) 

{  tranaaction  records  could  ba  axtandad  to  include  'a  tiata  field,  } ' 

{  and  ”trans[tp] .tine  :»  <currant  tiae>"  could  ba  added  to  the  ) 

{  following  procedure.  Siailar  Modifications  could  ba  aade  fox  } 

{  any  other  policy  baaed  on  infomation  available  at  Cbagias.  ) 

procedure  CCbeginCtp:  TPID); 
begin 

trans[tp] .status  :■  active 
end; 

procedure  CCvalid(tp:  TPID); 

begin 

with  trans[tp]  do  begin 
if  status  «  aborted 

then  SendResult(tp,  false) 

else  begin 
access  :»  V; 

RunSet  :*  RunSet  +  ftp] ; 

CCprocess 

end 

and 

and; 

procedure  CCend(tp:  TPID); 
begin 

canplete(tp) ; 

CCprocess 

end; 

procedure  CCsbort(tp:  TPID); 
begin 
abort (tp); 

CCprocess 

end; 


T'%  -r\ 


procedure  CCreed(tp:  TP  ID;  id:  OXD); 

bagin 

with  trees  [tp l  do  b«0in 
If  status  ■  aborted 

than  Seedftssult(tp,  false) 
also  If  tp  In  Met  (id] 

*thon  Sendfte*ult(tp,  true) 
also 
begin 

access  :•  ft; 

ObJID  :■  id; 

ftaeSet  :■  RunSet  ♦  (tp); 
CCprocess 
end 
and 

end; 

procedure  CGvrite(tp:  TPID;  id:  OID); 

begin 

with  trees  [tp]  do  begin 
if  stetus  *  eborted 

then  Sendftesalt(tp,  false) 
else  if  tp  In  V8et(id] 

.  then  Sendfteselt(tp,  tree) 
else 
begin 

if  tp  in  ISet(id) 
then  access  :■  V 
else  access  :•  IV; 

ObJID  ID; 

SenSet  :■  ftaeSet  ♦  [tp]; 
CCprocess 


procedure  look(tp:  TFXD)  i  *  -.I'blXnq 

var  Id:  0X0;  .  t*v 

procedure  VriteTSetCs:  TSet); 
var  tpi:  10X0; 

far  tpi  :»  HiaTFID  to  HmOPXO  da 
if  tpi  la  •  than  vriteCtpi) 

and; 

bagt" 

Writ* la (  “nUMiCTXON  \tp); 
wttb  traas(tp)  da  boptn 

vriteC'  States:  *); 

eaaa  status  of  actios:  vrltelaC'actlve'); 
pending;  vritelaC 'pending');  validated:  vrltala( 'validated'); 
aborted:  vritelaC 'aborted');  eoapleted:  or itela( ' osnpleted' )  and 

vriteC ’  Access:  *): 

case  access  of  acne:  vritelaC 'acne');  K:  vritelaC 'K'); 

W:  vritelaC 'Bf');  V:  vritela('V');  V:  vritelaC 'V')  and; 

vritelaC'  Object  XO:  ObJXO) ; 

vritelaC'  VaitCaoat:  '»  ValtConat:l); 

vriteC*  PrecedeSet:  ');  VrlteTBet CPraoededet) ;  vritela; 
vriteC'  VOaltSet:  ');  VriteTSet(VveltSet) ;  vritela; 

vriteC*  OvaitSet:  ');  VrlteTSetCCvaltfet);  vritela; 

vriteC*  leferfet:  *);  VrlteTBet (Bsferfet);  vritela; 

vriteC'  Objtet:  '); 

for  id  :•  NiaOXO  to  MaaOXO  da 
if  id  in  ObjSet  than  vriteC  id); 
vritela 


‘LS  V 


T 


«  trim; 

t  Mogtyp*)  tp:  ITOt  iis  010) 


*  ->-#> 


Mhtbi.  tp,  14.  Bop,  Mop,  Wop.  Wop.  1C 
«0M  ■  Of 

Ckjto:  CCWoglo(tp); 

Crood:  CCroodCtp,  14); 

Oorlto:  C0vrlto(tp,  id); 

Cvolld:  COrolid(t); 

Coad:  CCoad(tp); 

CObort:  CCohort(tp); 

Cpolley:  MpollopOtop.  Wop.  Wop,  Wop, 
Clook:  loofc(tp)) 

CfOlt:  } 

mm IN  a  *  Cgolt 


( - 


toaala(lapat))  totaoot(oatpat);  {  laltftalftao 

lalt) 

nfeollep(aalt .volt .aoit.aalt, oil))  (  dofoolt  to 

wrltola('(4ofoalt  loot  log  pailap  la  offoot)*)t 
drloar 


{  laltlollM  totalaol  l/o  ) 


*  >  *.  *¥&*!  if  J 

ObJS*:  A 

M  ******* 

mmsenm  4 

kMH  Mill 
iiimi  m 
otjmt  a:  a 
llltOMM:  > 

VtaltSMt 
AMitfKt 
AMni  US 
OAJtMt  A 

fl  ******* 

A— H>  ftwTFIi  VAUam 
mm  111  HI  «•  1*1:  <K.  V  •  1 

•t  ******* 

UMifi  (mini  M 

mm  Mm«i  w  1ft:  OK 
— »  Khm|I  «•  tVSi  01 
(mur  wnD 
it 

Ami I>|  fm  fPti  liUMOS 
mm  AUIH>  U  tfti  01,  IN  •  2 

«s  ******* 

Ami  hi  «m1VSi  MUMS 
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